Tutorials

February 17, 2017

Schedule

Monday, 10 July

1) Jay Kuo, USC, USA, “Deep Learning for Big Visual Data Analytics”
Time: 8:30am – 11:30am
Room: Grand Ballroom 1

2) Ivan Tashev, Microsoft Research, USA, “Spatial Audio for Virtual and Augmented Reality Devices”
Time: 1:30pm – 4:30pm
Room: Grand Ballroom II

3) Yuming Fang, Jiangxi University of Finance and Economics, China and Jiwen Lu, Tsinghua University, China, “Visual Content Perception and Understanding”
Time: 1:30pm – 4:30pm
Room: Salon III

Friday, 14 July

1) Ali C. Begen, Comcast, USA and Ozyegin University, Turkey and Christian Timmerer, Alpen-Adria-Universität Klagenfurt, Austria & Bitmovin, “Adaptive Streaming of Traditional and Omnidirectional Media over HTML5”
Time: 8:30am – 11:30am
Room: Grand ballroom 1

2) Zhenzhong Chen, Wuhan University and Jizheng Xu, Microsoft, “Future Video Coding: Beyond H.265/HEVC”
Time: 1:30pm – 4:30pm
Room: Grand Salon

3) Daniel P. K. Lun, the Hong Kong Polytechnic University, Hong Kong, Lap-Pui Chau, Nanyang Technological University, Singapore, “Image-based three-dimensional data acquisition”
Time: 1:30pm – 4:30pm
Room: Salon III

Spatial Audio for Virtual and Augmented Reality Devices

Abstract
Virtual and augmented reality devices are a hot topic for research and product development. In most of the cases these devices come in the form of head mounted displays. Integral part of these devices are several audio subsystems: spatial audio rendering, capture of the user’s voice and capturing the environmental audio.

Unlike the vision, where humans have approximately 90° field of view, human hearing covers all directions in all three dimensions. This means that the spatial audio system of these devices is expected to provide realistic rendering of sound objects in full 3D to complement the stereoscopic rendering of the visual objects.

In this tutorial, we will discuss the problems and potential solutions around the spatial audio subsystem in the devices for virtual and augmented reality. These include the Head-Related Transfer Functions (HRTFs) personalization, generation of the proper reverberation and distance cues. Most of the problems and solutions have broader scope and affect everyday scenarios such as listening to stereo music with headphones and watching movies on mobile devices.

Speaker
Dr. Ivan Tashev received his Master’s degree in Electronic Engineering in 1984 and PhD in Computer Science in 1990 from Technical University of Sofia, Bulgaria. He was Assistant Professor in the Department of Electronic Engineering of the same university in 1998, when moved to Microsoft in Redmond, USA. Currently Dr. Tashev is a Partner Software Architect and leads the Audio and Acoustics Research Group in Microsoft Research Labs in Redmond, USA. Ivan Tashev is a senior member of IEEE since 2006, member of Audio Engineering Society since 2006. Serves as member and associate member of IEEE SPS Audio and Acoustics Signal Processing Technical Committee and IEEE SPS Industrial DSP Standing Committee. Since 2012 he is adjunct professor in the Department of Electrical Engineering of the University of Washington in Seattle, USA.

Dr. Tashev published two scientific books as the sole author, wrote chapters in two other books, authored or coauthored more than 70 publications in scientific journals and conferences. Ivan Tashev is listed as inventor of 50 USA patent applications, 31 of them already granted. The audio processing technologies, created by Dr. Tashev, have been incorporated in Microsoft Windows, Microsoft Auto Platform, and Microsoft Round Table device. Dr. Tashev served as the leading audio architect for Kinect for Xbox and Microsoft HoloLens. More details about him can be found in his web page https://www.microsoft.com/en-us/research/people/ivantash/.

ivan-tashev

PDF version

Adaptive Streaming of Traditional and Omnidirectional Media

Abstract
This tutorial consists of three main parts. In the first part, we provide a detailed overview of the HTML5 standard and show how it can be used for adaptive streaming deployments. In particular, we focus on the HTML5 video, media extensions, and multi-bitrate encoding, encapsulation and encryption workflows, and survey well-established streaming solutions. Furthermore, we present experiences from the existing deployments and the relevant de jure and de facto standards (DASH, HLS, CMAF) in this space. In the second part, we focus on omnidirectional (360°) media from creation to consumption. We survey means for the acquisition, projection, coding and packaging of omnidirectional media as well as delivery, decoding and rendering methods. Emerging standards and industry practices are covered as well. The last part presents some of the current research trends, open issues that need further exploration and investigation, and various efforts that are underway in the streaming industry.

Target Audience and Prerequisite Knowledge
This tutorial includes both introductory and advanced level information. The audience is expected of understanding of basic video coding and IP networking principles. Researchers, developers, content and service providers are all welcome.

Table of Contents

  • Part I: The HTML5 Standard and Adaptive Streaming
    • HTML5 video and media extensions
    • Survey of well-established streaming solutions
    • Multi-bitrate encoding, and encapsulation and encryption workflows
    • The MPEG-DASH standard, Apple HLS and the developing CMAF standard
  • Part II: Omnidirectional (360°) Media
    • Acquisition, projection, coding and packaging of 360° video
    • Delivery, decoding and rendering methods
    • The developing MPEG-OMAF and MPEG-I standards
  • Part III: Open Issues and Future Directions
    • Common issues in scaling and improving quality, multi-screen/hybrid delivery
    • Ongoing industry efforts

Speakers
Ali C. Begen recently joined the computer science department at Ozyegin University, Turkey. Previously, he was a research and development engineer at Cisco, where he has architected, designed and developed algorithms, protocols, products and solutions in the service provider and enterprise video domains. Currently, in addition to teaching and research, he provides consulting services to industrial, legal, and academic institutions through Networked Media, a company he co-founded. Begen holds a Ph.D. degree in electrical and computer engineering from Georgia Tech. He received a number of scholarly and industry awards, and he has editorial positions in prestigious magazines and journals in the field. He is a senior member of the IEEE and a senior member of the ACM. In January 2016, he was elected as a distinguished lecturer by the IEEE Communications Society. Further information on his projects, publications, talks, and teaching, standards and professional activities can be found at http://ali.begen.net.

ali-c-begen

Christian Timmerer received his M.Sc. (Dipl.-Ing.) in January 2003 and his Ph.D. (Dr.techn.) in June 2006 (for research on the adaptation of scalable multimedia content in streaming and constrained environments) both from the Alpen-Adria-Universität (AAU) Klagenfurt. He joined the AAU in 1999 (as a system administrator) and is currently an Associate Professor at the Institute of Information Technology (ITEC) within the Multimedia Communication Group. His research interests include immersive multimedia communications, streaming, adaptation, Quality of Experience, and Sensory Experience. He was the general chair of WIAMIS 2008, QoMEX 2013, and MMSys 2016 and has participated in several EC-funded projects, notably DANAE, ENTHRONE, P2P-Next, ALICANTE, SocialSensor, COST IC1003 QUALINET, and ICoSOLE. He also participated in ISO/MPEG work for several years, notably in the area of MPEG-21, MPEG-M, MPEG-V, and MPEG-DASH where he also served as standard editor. In 2012 he cofounded Bitmovin (http://www.bitmovin.com/) to provide professional services around MPEG-DASH where he holds the position of the Chief Innovation Officer (CIO).

christian-timmerer

PDF version

Future Video Coding: Beyond H.265/HEVC

Abstract
The H.265/HEVC standard is the latest video coding standard developed by the Joint Collaborative Team on Video Coding (JCT-VC), from both ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Pictures Expert Group (MPEG) in January, 2013. When compared to its predecessor, H.264/ AVC, H.265/HEVC achieves significant bit-rate saving due to new coding technologies. Since then, extensions from H.265/HEVC, such as range extensions, scalability, multi-view, 3D video coding, and screen content coding, has been made for further applications. In addition, new coding technologies are being developed through VCEG-KTA, VP10, AVS2, IEEE 1857, JVET, 360° Video Coding, etc. This tutorial will be focused on newly developed technologies from VCEG-KTA, VP10, AVS2, IEEE 1857, JVET, 360° Video Coding, as well as recent advances of perceptual video coding, besides H.265/HEVC video coding standard and its extensions: RExt (range extensions), SHVC (scalable extension), and MV-HEVC (multiview extension) SCC (Screen Content Coding extension). Finally, this talk will be concluded by the summaryand discussions on future possibilities and challenges.

Table of content

  1. Introduction
  2. Fundamentals of Video Coding
  3. H.265/HEVC and Extensions
  4. Future Video Coding Part I: JVET Activities
  5. Future Video Coding Part II: 360° Video Coding
  6. Future Video Coding Part III: Video Coding Optimization
  7. Summary and Discussions

Speaker

Zhenzhong Chen received the B.Eng. degree from Huazhong University of Science and Technology, Wuhan, China, and the Ph.D. degree from the Chinese University of Hong Kong, Shatin, China, both in electrical engineering. He is currently a Professor at Wuhan University (WHU). Before joining WHU, he worked at MediaTek USA Inc. San Jose, CA, USA. His current research interests include video coding and standardization, visual perception, image processing, multimedia communications, etc. He has been an active contributor to ISO/MPEG and ITU-T video coding standards and Audio and Video Coding Standard Workgroup of China, for High Efficiency Video Coding (HEVC), HEVC range extension and HEVC scalable video coding standards, AVS2 and AVS 3D, IEEE 1857, etc. He has over 20 U.S. and Chinese patents granted or pending in image and video coding and processing. He has been a VQEG board member and Immersive Media Working Group Co-Chair, a Selection Committee Member of ITU Young Innovators Challenges, a member of the IEEE Multimedia Systems and Applications Technical Committee, Co-Chair of IEEE Multimedia Communication TC Networking Technologies for Multimedia Communication IG. He is an editor of Journal of Visual Communication and Image Representation and Editor of IEEE IoT Newsletter. He has been the Special Session Chair of IEEE World Forum of Internet of Things 2014, Publication Chair of IEEE Conference on Multimedia and Expo 2014, and has served as an area chair for ICME 2015/2016, IEEE BigMM2015, a technical program committee member of the IEEE ICC, GLOBECOM, CCNC, ICME, etc. He was a recipient of CUHK Young Scholar Dissertation Award, the CUHK Faculty of Engineering Outstanding Ph.D. Thesis Award, Microsoft Fellowship, ERCIM Alain Bensoussan Fellowship, and First Class Prize of 2015 IEEE BigMM Challenge, IEEE SMC 2015 Best Paper Franklin V. Taylor Memorial Award Finalist. He is an IEEE senior member.

zhenzhong_chen

Ji-Zheng Xu received the B.S. and M. S. degrees in computer science from the University of Science and Technology of China (USTC), and the Ph.D. degree in electrical engineering from Shanghai Jiaotong University, China. He joined Microsoft Research Asia (MSRA) in 2003 and currently he is a Lead Researcher. His research interests include image and video representation, media compression, and communication. He has been an active contributor to ISO/MPEG and ITU-T video coding standards. He has over 40 technical proposals adopted by H.264/AVC, H.264/AVC scalable extension, High Efficiency Video Coding, HEVC range extension and HEVC screen content coding standards. He chaired and co-chaired the ad-hoc group of exploration on wavelet video coding in MPEG, and various technical ad-hoc groups in JCT-VC, e.g., on screen content coding, on parsing robustness, on lossless coding. He has authored and co-authored over 100 conference and journal refereed papers. He co-organized and co-chaired special sessions on scalable video coding, directional transform, high quality video coding at various conferences. He has over 30 U.S. patents granted or pending in image and video coding. He is also served as special session co-chair of IEEE International Conference on Multimedia and Expo 2014. He served as a Guest Editor for a special issue on Screen Content Video Coding and Applications for IEEE Journal on Emerging and Selected Topics in Circuits and Systems. He has been an IEEE senior member since 2010.

Jizheng_Xu

Visual Content Perception and Understanding

Abstract
Over the past decades, there has been a growing interest in visual content perception and understanding with broad multimedia applications. As an important component of visual perception, visual quality assessment has attracted extensive attention from both academia and industry. It can be used not only in monitoring image quality distortions, but also in optimizing various image processing algorithms/systems. Visual recognition aims to analyze visual content from images and videos, which plays an important role in many visual content understanding systems such as face recognition, person re-identification, and visual search. In this tutorial, we provide the past achievements, current status, and open problems of visual quality assessment and visual recognition.

Table of Contents
This tutorial will mainly introduce the development of visual quality assessment and visual recognition. In the first section, we briefly introduce the achievements of visual quality assessment during the past decade, and provide the key advantages and disadvantages of existing visual quality assessment metrics. Then we will introduce some of new visual quality assessment methods proposed in our team, mainly including two types: general-purpose and application-specific approaches. Some open problems in visual quality assessment are also discussed for the potential research in the future. In the second section, we introduce the basic concept of distance metric learning, and show the key advantages and disadvantages of existing distance metric learning methods in different visual understanding tasks. Then, we introduce some of our newly proposed distance metric learning methods such as deep metric learning, cost-sensitive metric learning, and Hamming distance metric learning, respectively. Lastly, we discuss some open problems in distance metric learning to show how to further develop more advanced metric learning algorithms for visual understanding in the future.

Presenters
Yuming Fang, Ph.D
School of Information Technology
Jiangxi University of Finance and Economics, Nanchang, China
E-mail: fa0001ng@e.ntu.edu.sg

Biography: Yuming Fang is a Professor in the School of Information Technology, Jiangxi University of Finance and Economics, Nanchang, China. Previously, he was a (visiting) researcher in IRCCyN lab, PolyTech’ Nantes, Univ. Nantes, France, National Tsinghua University, Taiwan and University of Waterloo, Canada. He received his Ph.D. degree from Nanyang Technological University in Singapore, M.S. degree from Beijing University of Technology in China, and B.E. degree from Sichuan University in China. His research interest includes visual quality assessment, visual attention, computer vision, etc. He has published over 90 scientific papers in these areas, including over 30 papers are published in the IEEE Transactions/Journals. He is an IEEE MMTC member. And he serves as an associate editor of IEEE Access and Signal Processing: Image Communications. He is/was a TPC Co-Chair for ISITC 2016, a Program Chair for the Third Workshop on Emerging Multimedia Systems and Applications, an Area Chair for VCIP 2016, ICME 2016, ICME 2017, etc. He is a member of IEEE.

yuming-fang

Jiwen Lu, Ph.D
Department of Automation
Tsinghua University, Beijing, China
E-mail: lujiwen@tsinghua.edu.cn

Biography: Jiwen Lu is an Associate Professor with the Department of Automation, Tsinghua University, China. From March 2011 to November 2015, he was a Research Scientist at the Advanced Digital Sciences Center (ADSC), Singapore. His research interests include computer vision, pattern recognition, and machine learning. He has authored/co-authored over 150 scientific papers in these areas, where 40 papers are published in the IEEE Transactions (4 PAMI and 10 TIP) and 20 papers are published in top-tier computer vision conferences (ICCV/CVPR/ECCV). He is an elected member of the Information Forensics and Security Technical Committee of the IEEE Signal Processing Society. He serves an Associate Editor of Pattern Recognition Letters, Neurocomputing, and IEEE Access, and a Guest Editor of 5 journals such as Pattern Recognition, Computer Vision and Image Understanding, and Image and Vision Computing. He is/was a Program Co-Chair for ICGIP’17, an Area Chair for ICIP’17, BTAS’16, ICB’16, WACV’16, VCIP’16, ICME’15, and ICB’15, a Workshop Co-Chair for WACV’17 and ACCV’16, and a Special Session Co-Chair for VCIP’15, respectively. He is a senior member of the IEEE.

jiwen-lu

PDF version

Image-based three-dimensional data acquisition

Abstract
With the advance in image processing technology, many image-based three-dimensional data acquisition techniques have been developed and used in various applications including medical diagnosis, machine part inspection, movie and video game production, and many others. The objective of this tutorial is to give an overview of the basic principles of these techniques and discuss their latest development. The first part of this tutorial will introduce the basic principles of the fringe projection profilometry (FPP), which is one of the popular structured light illumination techniques for measuring the 3D shape of objects in a non-contact manner.  Some latest development of robust FPP using the sparse representation techniques will also be discussed. The second part of this tutorial will cover the topic of motion capture (Mocap) data processing. The techniques for Mocap data compression will be introduced; and the applications of depth map camera will also be discussed.

Table of Contents
The first part of this tutorial is on the fringe projection profilometry (FPP) techniques for 3D measurement of objects’ shape. It starts with an introduction of the background and applications of the FPP techniques. Some current applications of FPP in 3D microscopy, 3D endoscopes and other 3D medical devices will be explained. Then the principles of two majors FPP techniques will be elaborated. They include the Fourier transform profilometry (FTP) and phase shifted profilometry (PSP). Their limitations in practical working environment will also be illustrated. These limitations in general lower the robustness of the FPP in adverse working environments. Then two state-of-the-art sparse representation techniques for enhancing the robustness of FPP will be introduced. Simulation and experimental results will be shown to illustrate the effectiveness of these approaches. The second part of this tutorial is on motion capture data processing. It starts with an introduction of Mocap signal processing techniques. Some latest topics such as Mocap data recovery, low rank representation and trajectory-based representation of Mocap data will be discussed. In addition, the tutorial will also cover topics in Mocap sequence compression using the low rank matrix approximation and low latency Mocap data compression. At the end, two applications of Mocap data processing will be briefly discussed. Video demonstration will be shown to illustrate the performance of the applications.

Speaker
Daniel, P.K. Lun received his B.Sc.(Hons.) degree from the University of Essex, U.K., and PhD degree from the Hong Kong Polytechnic University in 1988 and 1991, respectively. In 1991, he joined the Hong Kong Polytechnic University and is now an Associate Professor and Interim Head of the Department of Electronic and Information Engineering. His research interest includes wavelets theory, signal and image enhancement, computational imaging and 3D model reconstruction. He was the Chairman of the IEEE Hong Kong Chapter of Signal Processing in 1999-00. He was the General Chair of 2004 International Symposium on Intelligent Multimedia, Video and Speech Processing (ISIMP 2004), and General Co-Chair of 19th International Conference on Digital Signal Processing (DSP 2014). He was also the Technical Co-Chair of 2015 APSIPA Annual Summit and Conference (APSIPA ASC 2015), and Technical Co-Chair of 2015 IEEE 20th International Conference on Digital Signal Processing (DSP 2015). He was an Area Chair of ICME 2016 and 2017. He was the executive committee members of a number of other international conferences including the ICASSP 2003, ICIP 2010 and ICME 2017. He received the Certificate of Merit from the IEEE Signal Processing Society for dedication and leadership in organizing the ICIP 2010. He and his research students have also received three best paper awards in international conferences. He was the Editor of HKIE Transactions published by the Hong Kong Institution of Engineers (HKIE) in the area of Electrical Engineering. He was also the leading guest editor of a special issue of EURASIP journal of Advances in Signal Processing. He is currently an Associate Editor of IEEE Signal Processing Letters. He is a Chartered Engineer, a fellow of IET, a corporate member of HKIE and a senior member of IEEE. Further information of his publications, academic and professional activities can be found in www.eie.polyu.edu.hk/~enpklun.

lun

Lap-Pui Chau received the Bachelor degree from Oxford Brookes University,  and the Ph.D. degree from The Hong Kong Polytechnic University, in 1992 and 1997, respectively.  His research interests include fast visual signal processing algorithms, light-field imaging, VLSI for signal processing, and human motion analysis. He was a General Chairs for IEEE International Conference on Digital Signal Processing (DSP 2015) and International Conference on Information, Communications and Signal Processing (ICICS 2015). He was a Program Chairs for International Conference on Multimedia and Expo (ICME 2016), Visual Communications and Image Processing (VCIP 2013) and International Symposium on Intelligent Signal Processing and Communications Systems (ISPACS 2010). He was the chair of Technical Committee on Circuits & Systems for Communications (TC-CASC) of IEEE Circuits and Systems Society from 2010 to 2012. He served as an associate editor for IEEE Transactions on Multimedia, IEEE Signal Processing Letters, IEEE Transactions on Circuits and Systems for Video Technology, IEEE Circuits and Systems Society Newsletter, and is currently serving as an associate editor for IEEE Transactions on Circuits and Systems II, IEEE Transactions on Broadcasting, and The Visual Computer (Springer Journal). Besides, he was an IEEE Distinguished Lecturer for 2009-2016, and a steering committee member of IEEE Transactions for Mobile Computing from 2011-2013. He is an IEEE Fellow. Further information of his publication and research achievements can be found in http://www.ntu.edu.sg/home/elpchau.

chau

PDF version

Deep Learning for Big Visual Data Analytics

Abstract
Deep learning has received a lot of attention in recent years due to its superior performance in many computer vision benchmarking datasets. In this course, I will cover the following four topics.

A. Big Visual Data Analytics

I will first discuss several basic concepts related to the big visual data analytics problem such as discriminative versus generative models, heavily versus weakly supervised learning, etc. to set up the modern data-driven machine learning methodology. I will also introduce several well-known datasets such as the ImageNet, the Places, the Microsoft CoCo datasets.

B. CNN Architectural Evolution

I will begin with the McClulloch and Pitts (M-P) neuron model and network in 1943, then the artificial neural network (ANN) in 80s and 90s, and finally the modern convolutional neural network (CNN) since late 90s. The differences between these three generations of networks will be clearly explained. Furthermore, there are three main types in modern CNNs: pyramidal CNNs, fully convolutional networks (FCN) and residual networks. Their specific functions are explained.

C. Theoretical Foundation

I will provide a theoretical foundation to the working principle of the CNN from a signal processing viewpoint. To begin with, the RECOS transform is introduced as a basic building block for CNNs. The term “RECOS” is an acronym for “REctified-COrrelations on a Sphere”. It consists of two main concepts: data clustering on a sphere and rectification. Then, a CNN is interpreted as a network that implements the guided multi-layer RECOS transform. Along this line, we first compare the traditional single-layer and modern multi-layer signal analysis approaches. Then, we discuss how guidance is provided by data labels through backpropagation in the training with an attempt to offer a smooth transition from weakly to heavily supervised learning.

D. Deep Features Visualization and Analysis

A deep network can learn features (called deep features) automatically from training data. I will use two quantitative metrics to shed lights on learned deep features. They are the Gaussian confusion measure (GCM) and the cluster purity measure (CPM). The GCM is used to identify the discriminative ability of an individual feature while the CPM is used to analyze the group discriminative ability of a set of deep features. It is confirmed by experiments that these two metrics accurately reflect the discriminative ability of trained deep features.

Speaker

Dr. C.-C. Jay Kuo received his Ph.D. degree from the Massachusetts Institute of Technology in 1987. He is now with the University of Southern California (USC) as Director of the Media Communications Laboratory and Dean’s Professor in Electrical Engineering-Systems. His research interests are in the areas of digital media processing, compression, communication and networking technologies. Dr. Kuo was the Editor-in-Chief for the IEEE Trans. on Information Forensics and Security in 2012-2014. He was the Editor-in-Chief for the Journal of Visual Communication and Image Representation in 1997-2011, and served as Editor for 10 other international journals.
Dr. Kuo received the 1992 National Science Foundation Young Investigator (NYI) Award, the 1993 National Science Foundation Presidential Faculty Fellow (PFF) Award, the 2010 Electronic Imaging Scientist of the Year Award, the 2010-11 Fulbright-Nokia Distinguished Chair in Information and Communications Technologies, the 2011 Pan Wen-Yuan Outstanding Research Award, the 2014 USC Northrop Grumman Excellence in Teaching Award, the 2016 USC Associates Award for Excellence in Teaching, the 2016 IEEE Computer Society Taylor L. Booth Education Award, the 2016 IEEE Circuits and Systems Society John Choma Education Award, the 2016 IS&T Raymond C. Bowman Award, and the 2017 IEEE Leon K. Kirchmayer Graduate Teaching Award. Dr. Kuo is a Fellow of AAAS, IEEE and SPIE. He has guided 140 students to their Ph.D. degrees and supervised 25 postdoctoral research fellows. Dr. Kuo is a co-author of about 250 journal papers, 900 conference papers, 14 books and 30 patents.