Demo #1: JPEG XS Low latency error robust low latency video transport
Presenter: Thomas Richter, Siegfried Fößel (Fraunhofer IIS)
Abstract: In this demo, we will show real-time video capture and recording with JPEG XS using forward error protection following SMPTE 2022-5. JPEG XS is a low-latency, low complexity video coding codec whose first and second edition have been standardized by ISO as ISO/IEC 21122, and whose third edition adding time differential coding is currently under standardization. In this demo, a mini-PC will compress a life signal from a camera in real time, and transport the JPEG XS codestream following IETF RFC 9134. While forward error correction schemes are currently not part of any JPEG XS related standard, an error protection mechanism following SMPTE 2022-5 will protect the stream in this demo from packet loss and burst errors. An error simulator, triggered by visitors of the demo, can be used to stress it such that visitors can observe the impact of errors with and without the additional error correction streams present. The decoder, also implemented in software on a mini-PC, will decode the signal and render it on a monitor in real time. If requested and considered useful by the program committee, a talk on the workings of JPEG XS, its standardization progress and SMPTE 2022-5 can be prepared.

Demo #6: Media over QUIC: Initial Testing, Findings and Results
Presenter: Zafer Gurel, Ozyegin University (Turkiye), Tugce Erkilic Civelek, Ozyegin University (Turkiye), Ali C. Begen, Ozyegin University (Turkiye) and Comcast (USA), Alex Giladi, Comcast (USA)
Abstract: Over-the-top (OTT) live sports has led sports broadcasting to a new level, where fans can stream their favorite games on connected devices. However, there are still challenges that need to be tackled. Nobody wants to hear a neighbor’s cheers when a goal is scored before seeing it on the screen, making low-latency transport and playback indispensable. Synchronization among all the viewing devices and social media feeds is also essential. The existing HTTP ecosystem comprises solid foundational components such as distributed caches, efficient client applications and high-performant server software glued with HTTP. This formation allows efficient live media delivery at scale. However, the two popular approaches, DASH and HLS, are highly tuned for HTTP/1.1 and 2 running on top of TCP. The downside is the latency caused by the head-of-line (HoL) blocking experienced due to TCP’s in-order and reliable delivery. The latest version of HTTP (HTTP/3) uses QUIC underneath instead of TCP. QUIC can carry different media types or parts in different streams. These streams can be multiplexed over a single connection avoiding the HoL blocking. The streams can also be prioritized (or even discarded) based on specific media properties (e.g., dependency structure and presentation timestamp) to trade off reliability with latency. DASH and HLS can readily run over HTTP/3, but they could only recoup the benefits if they could use its unique features. The IETF recently formed a new working group to develop a QUIC-based low-latency delivery solution for media ingest and distribution in browser and non-browser environments. Targeted use cases include live streaming, cloud gaming, remote desktop, videoconferencing and eSports. The work is still in its infancy, but we believe media over QUIC running in an HTTP/3 or WebTransport environment could potentially be a game-changer. This demo, among the first, presents the architectural design issues and preliminary results from the early prototypes.

Demo #7: Camera-based Gaze Tracker Driven Robotic and Assisted Living/Hospital Bed Use-cases
Presenter: Mithun B S,Tince Varghese, Aditya Choudhary, Rahul Dasharath Gavas, Ramesh Kumar Ramakrishnan, Arpan Pal (TCS Research, India)
Abstract: In recent years, hands free controlling of robots, hospital beds, senior assistance platforms using various modalities like handheld devices, physiological sensors and so on are gaining wide acceptance. Gaze tracking serves to be a better alternative owing to its non-obtrusiveness and freeing up of hands. Health care applications and human robot use cases can benefit largely using gaze. To meet this goal, we first developed an RGB camera-based gaze tracker that can be easily deployed in standard computer for daily usage. We deploy this gaze tracker to run two use-cases, viz., controlling robot’s movements and assisted living/hospital bed use cases. In the case of robotic use case, our primary goal is to control a robot’s movements using gaze tracking over handheld devices, to facilitate the usage of hands for other important tasks. In this regard, we provide a display screen to the users which shows the field of view of the robot. The user is required to gaze at the locations where he/she wants the robot to move and thus the locomotion of the robot is controlled accordingly. The system is tested exclusively in our lab, and it is found that the users can easily and effectively control the robot’s navigation in any direction using the proposed system. This can be used in hospital scenario, wherein the doctor, if physically not present near patient, can attend using the robot and assess patient’s health and have interactions with them, while his/her hands are free to make notes or to ascertain any other important actions and like in the case of telerobotic, where operator’s hands are free to manipulate the robot arms while moving the robot with the eye tracker. In the second use case, we intend to provide an assistive living platform for bed ridden patients or elderly people using an interface showing options for ordering food, snacks, calling doctor or nurse or even controlling the bed for adjusting the height or the angle of the head or the leg portions of the bed. We conducted an in-house pilot study to assess the ease and effectiveness of controlling this system over conventionally used input devices like mouse. It was seen that the time taken and the effectiveness of usage via gaze tracking as input modality was in-line with that of conventional input devices.

Demo #8: VVC in a large-scale streaming environment
Presenter: Kevin Rocard (Bitmovin), Jacob Arends (Bitmovin), Adam Wieckowski (Fraunhofer HHI), Benjamin Bross (Fraunhofer HHI)
Abstract: Versatile Video Coding (VVC) is the latest video coding standard released in mid-2020 as the successor to the High Efficiency Video Coding (HEVC) standard. VVC has been developed to provide up to 50% bit-rate savings compared to HEVC for the same perceived video quality. Unlike previous video coding standards, VVC already includes in its first version specific coding tools and systems functionalities for a wide range of applications. To video streaming, VVC offers several benefits, such as efficient streaming of demanding content, including UHD (8k+) and 360 video, lower distribution costs due to the lower bitrate and maintaining high visual quality even on slower networks. This demo shows how VVC performs in an actual end-to-end video streaming environment at low bitrates. In the past Bitmovin and Fraunhofer HHI already demonstrated VVC cloud encoding and playback in a browser. The videos have been encoded with the Bitmovin cloud transcoding solution using VVenC, an open VVC encoder from Fraunhofer HHI. This has been further evolved using a so-called smart chunking approach. After encoding, the assets are made available using Dynamic Adaptive Streaming over HTTP (DASH). On the player side, the open VVC decoder VVdeC has been integrated into an internal version of the Bitmovin Android demo app to evaluate VVC playback on a range of streams and devices. To test the VVC low bitrate performance, the same HD assets have been encoded in both VVC and HEVC at the same low bitrates below 500 kb/s. Upon playback on a Samsung Galaxy S8 Android tablet, the performance and image quality of the streams can be compared.

VVenC Software Repository on Github:
https://github.com/fraunhoferhhi/vvenc
VVdeC Software Repository on Github:
https://github.com/fraunhoferhhi/vvdec
Bitmovin’s Smart Chunking:
https://bitmovin.com/smart-chunking-encoding/
VVC – Its Benefits, Supported Devices, and How Bitmovin is Implementing it:
https://bitmovin.com/vvc-benefit-supported-devices-bitmovin-implementation/

Demo #9: A Real-time Chinese Food Auto Billing System based on Synthetic images
Presenter: Qiushi Guo, Yifan Chen, Jin Ma, Tengteng Zhang (China Merchants Bank)
Abstract: In recent years, the topic of food segmentation has gained significant attention in both academic and industrial circles. Various solutions have been proposed for the segmentation of Western food, demonstrating promising performance that aligns with the requirements of applications such as diet management and calorie estimation. Motivated by these accomplishments, we have undertaken the design of an automatic billing system for Chinese food prices based on instance segmentation methods. However, the segmentation of Chinese food poses a formidable challenge due to the extensive range of ingredients and cooking styles involved. It is infeasible to amass a sufficiently large image dataset that encompasses all potential variations of Chinese cuisine for training a segmentation model. To address this challenge, rather than attempting to detect individual dishes, we have reformulated the task by focusing on segmenting a curated selection of plates containing Chinese food. In this regard, we introduce a FoodSyn module, which employs image synthesis techniques by extracting food portions from the UECFoodPIX dataset and seamlessly integrating them into plate images. The resulting synthesized images are then utilized for training an encoder-decoder network to perform instance segmentation. Extensive experimentation has demonstrated the efficacy of our proposed approach in practical scenarios, achieving a mean Intersection over Union (mIoU) exceeding 95%, and the accuracy of final price estimation is over 99%. rate surpassing 20 frames per second (fps). We intend to release the source code once the paper detailing our research is accepted for publication.

Demo #10: Instant Object Registration System for Image Recognition of Retail Products
Presenter: Tomokazu Kaneko, Soma Shiraishi, Makoto Terao (NEC Corporation)
Abstract: We propose an efficient product registration system for image recognition AIs in retail stores. The system automatically generates a product image dataset for training AIs. The user only needs to shoot a video of the product being held and rotated in hand for 10-20 seconds. The proposed system approaches the problem of extending recognition products not by methods such as few-shot learning but from the perspective of improving the efficiency of the registration process. An image dataset of each product is required to train a classification model of retail products. However, hundreds of new products are introduced daily in the retail domain; therefore, an efficient dataset generation system is required to keep the dataset up to date. The conventional process of creating an image dataset requires tedious tasks, including taking pictures, manual annotation, and cleansing image sets. At first of the process, we take images of the products from various angles using a turntable. This process requires special equipment, such as a turntable, and the shooting environment must be set up so that other products do not appear in the background. Next, we manually annotate the location of the products in the captured images and cut out the product images. At the same time, we remove inappropriate images for training due to factors such as blurring or out-of-focus. These manual annotation processes are expensive and take about 30 minutes per product. The proposed system generates an image dataset from a video of the product moved by hand. The system estimates the product position in the video and cuts out images. At this localization step, the system focuses on the moving area in the video, taking advantage of the condition that the product is moved by hand. This method allows the system to specify only the product area to be registered, even in unknown background environments where other objects may appear. Furthermore, the system is equipped with a function to detect blur and occlusion of the product caused by hand movements. This function allows the automatic removal of inappropriate images for the dataset. The proposed system requires only an RGB camera and a PC or a mobile device; therefore, neither special equipment nor a particular environment is needed. With the proposed system, the registration process, which used to take 50 hours for 100 products, can be reduced to 30 minutes, allowing on-site registration. In the demonstration, we will present two versions of the proposed system, one is for PC, and the other is for smartphones. The PC version consists of a laptop PC and a camera to capture the video. The smartphone version requires a smartphone only. We also prepare some products and a small shelf displaying the products to show the registration process. Audiences will experience the process of creating a dataset by picking up the products and taking videos of them using the system.