Looking inside an MPEG meeting

Introduction

In There is more to say about MPEG standards I presented the entire spectrum of MPEG standards. No one should deny that it is an impressive set of disparate technologies integrated to cover  fields connected by the common thread of Data Compression: Coding of Video, Audio, 3D Graphics, Fonts, Digital Items, Sensors and Actuators Data, Genome, and Neural Networks; Media Description and Composition; Systems support; Intellectual Property Management and Protection (IPMP); Transport; Application Formats; API; and Media Systems.

How on earth can all these technologies be specified and integrated in MPEG standards to respond to industry needs?

This article will try and answer this question. It will do so by starting, as many novels do, from the end (of an MPEG meeting).

Let’s start from the end (of an MPEG meeting)

When an MPEG meeting closes, the plenary approves the results of the week, marking the end of formal collaborative work within the meeting. Back in 1990 MPEG developed a mechanism – called “ad hoc group” (AhG) – that would allow to continue a form of collaboration. This mechanism allows MPEG experts to continue working together, albeit with limitations:

  1. In the scope, i.e. an AhG may only work on the areas identified by the mandates (in Latin ad hoc means “for a specific purpose”). Of course experts are free to work individually on anything and in any way that please them;
  2. In the purpose, i.e. an AhG may only prepare recommendations – in the scope of its mandates – to be submitted to MPEG. This is done at the beginning of the following meeting, after which the AhG is disbanded;
  3. In the method of work, i.e. an AhG operates under the leadership of one or more Chairs. Clearly, though, the success of an AhG depends very much on the attitude and activity of its members.

On average some 25 AhGs are established at each meeting. There is not one-to-one correspondence between MPEG activities and AhGs. Actually AhGs are great opportunities to explore new and possibly cross-subgroup ideas.

Examples of AhG titles are

  1. Scene Description for MPEG-I
  2. System technologies for Point Cloud Coding (PCC)
  3. Network Based Media Processing (NBMP)
  4. Compression of Neural Networks (NNR).

What happens between MPEG meetings

An AhG uses different means to carry out collaborative work:  by using reflectors, by teleconferencing and by holding physical meetings. The last can only be held if they were scheduled in the AhG establishment form. Unscheduled physical meetings may only be held if there is unanimous agreement of those who subscribed to the AhG.

Most AhGs hold scheduled meetings on the weekend that precedes the next MPEG meeting. These are very useful to coordinate the results of the work done and to prepare the report that all AhGs must make to the MPEG plenary on the following Monday.

AhG meetings, including those in the weekend preceding the MPEG meeting, are not formally part of an MPEG meeting.

An MPEG meeting at a glance

Chairs meeting

MPEG chairs meet three times during an MPEG week:

  1. On Sunday evening to review the progress of AhG work, coordinate activities impacting more than one Subgroup and plan activities to be carried out during the week including the need for joint meetings;
  2. On Tuesday evening to assess the result of the first two days of work, review the work plan and time lines based on the expected outcomes and identify the need of new joint meetings;
  3. On Thursday evening to wrap up the expected results and review the preliminary results of the week.

Plenary meetings

During an MPEG week MPEG holds 3 plenaries

  1. On Monday morning: to make everybody aware of the results of work carried out since the last meeting and to plan work of the week. AHG reports are a main part of it as they are presented and, when necessary, discussed;
  2. On Wednesday morning to make everybody aware of the work done in all subgroups in the first two days and to plan work for the next two days;
  3. On Friday afternoon to approve the results of the work of Subgroups, including liaison letters, to establish new AhGs etc.

Subgroup, Breakout Group and Joint meetings

Subgroups start their meetings on Monday afternoon. They review their own activities and kick off work in their areas. Each subgroup assigns activities to breakout groups (BoG) who meet with their own schedules to achieve the goals assigned. Each Subgroup may hold other brief meetings to keep everybody in the Subgroup in sync with the general progress of the work.

For instance, the activities of the Systems Subgroups are currently: File format, DASH, OMAF, OMAF and DASH, OMAF and MIAF, MPEG Media Transport, Network Based Media Processing and PCC Systems.

The MPEG structure is designed to facilitate interactions between different Subgroups and BoGs from different Subgroups to discuss matters that affect different Subgroups and BoGs, because they are at the interface of MPEG subsystems, For example, the table below lists the joint meetings that the Systems Subgroup held with other Subgroups at the January 2019 meeting.

Table 1 – Joint meeting of Systems Subgroup with other Subgroups

Systems meeting with Topics
Reqs, Video, VCEG SEI messages in VVC
Audio, 3DG Scene Description
3DG Systems for Point Cloud Compression
3DG API for multiple decoders
Audio Uncompressed Audio
Reqs, JVET, VCEG Immersive Decoding Interface

NB: VCEG is the Video Coding Experts Group of ITU-T Study Group 16. It is not an MPEG Subgroup.

On Friday morning all Subgroups approve their own results. These are automatically integrated in the general document to be approved by the MPEG Plenary on Friday afternoon.

Advisors meeting

On Monday evening, an informal group of experts from different countries examines issues of general (non-technical) interest. In particular it calls for meeting hosts, reviews proposals of meeting hosts, makes recommendations of meeting hosts to the plenary etc.

A bird’s eye view of an MPEG meeting

Figure 1 depicts the workflow described in the paragraphs above, starting from the end of the N-1 th meeting to the end of the N-th meeting.

Figure 1 – A snapshot of MPEG works from the end of a meeting to the end of the next meeting

What is “done” at an MPEG meeting?

There are around 500 of the best worldwide experts attending an MPEG meeting. It is an incredible amount of brain power that is mobilised at an MPEG meeting. In the following I will try and explain how this brain power is directed.

An example – the video area

Let’s take as example the work done in the Video Coding area at the March 2019 meeting.

The table below has 3 columns:

  1. The standards on which work is done (Video has worked on MPEG-H, MPEG-I, MPEG-CICP, MPEG-5 and Explorations)
  2. The names of the activities and
  3. The types of documents resulting from the activities (see the following legend for an explanation of the acronyms).

Table 2 – Documents produced in the video coding area

Std Activity Document type
H High Efficiency (HEVC) TM, CE, CTC
I Versatile Video Coding (VVC) WD, TM, CE, CTC
3 Degrees of Freedom + (3DoF+) coding CfP, WD, TM, CE, CTC
CICP Usage of video signal type code points (Ed. 1) TR
Usage of video signal type code points (Ed. 2) WD
5 Essential Video Coding WD, TM, CE, CTC
Low Complexity Enhancement Video Coding CfP, WD, TM, CE, CTC
Expl 6 Degrees of Freedom (6DoF) coding EE, Tools
Coding of dense representation of light fields EE, CTC

Legend

TM: Test Model, software implementing the standard (encoder & decoder)

WD: Working Draft

CE: Core Experiment, i.e. definition of and experiment that should improve performance

CTC: Common Test Conditions, to be used by all CE participants

CfP: Call for Proposals (this time no new CfP produced, but reports and analyses of submissions in response to CfPs)

TR: Technical Report (ISO document)

EE: Exploration Experiment, an experiment to explore an issue because it si not mature enough to be a CE

Tools: other supporting material, e.g. software developed for common use in CEs/EEs

What is produced by an MPEG meeting

Figure 2 gives the number of activities for each type of activity defined in the legend (and others that were not part of the work in the video area). For instance, out of a total of 97 activities:

  1. 29 relate to processing of standards through the canonical stages of Committee Draft (CD), Draft International Standard (DIS) and Draft International Standard (FDIS) and the equivalent for Amendments, Technical Reports and Corrigenda. In other words, at every meeting MPEG is working on ~10 “deliverables” (i.e. standards, amendments, technical reports or corrigenda) in the approval stages;
  2. 22 relate to working drafts, i.e. “new” activities that have not entered the approval stages;
  3. 8 relate to Technologies under Consideration, i.e. new technologies that are being considered to enhance existing standards;
  4. 8 relate to requirements, typically for new standards;
  5. 6 relate to Core Experiments;
  6. Etc.

 

Figure 2 – Activities at an MPEG meeting

Figure 2 does not provide a quantitative measure of “how many” documents were produced for each activity or “how big” they were.  As an example, Point Cloud Compression has 20 Core Experiments and 8 Exploration Experiments under way, while MPEG-5 EVC has only one large CE.

An average value of activity at the March 2019 meeting is provided by dividing the number of output documents (212), by the number of activities (97), i.e. 2.2.

Conclusions

MPEG holds quarterly meetings with an attendance of ~500 experts. If we assume that the average salary of an MPEG expert is 500 $/working day and that every expert stays 6 days (to account for attendance at AhG meetings), the industry investment in attending MPEG meetings is 1.5 M$/meeting or 6 M$/year. Of course, the total investment is more than that and probably in excess of 1B$ a year.

With the meeting organisation described above MPEG tries to get the most out of the industry investment in MPEG standards.

Posts in this thread

MPEG and ISO

Introduction

The article MPEG: what it did, is doing, will do recounts my statistically not insignificant experience of asking taxi drivers across different cities of the world if they know MPEG. I do not have similar amount of data to report for ISO, but I am pretty sure that if I asked a taxi driver if they know ISO, the yes rate would be considerably lower than for MPEG.

This is not merit of MPEG or demerit of ISO as organisations. MPEG – Moving Pictures Experts Group – is lucky to deal with things that let people make content that other people can see and hear in ever new ways. ISO – International Organisation for Standardisation – is an organisation with the mission to develop international standards for anything that is not telecommunication – the purview of the International Telecommunication Union (ITU) – and electrotechnical – the purview of the International Electrotechnical Commission (IEC).

The ISO organisation

The above may seem rather abstract, so let’s see what the difference means in practice. ISO is a huge organisation structured in Technical Committees (TC). Actually, the structure is more complex than that (see Figure 1), but for the purpose of what I want to say, this is enough.

The first 3 – still active – TCs in ISO are: TC 1 Screw threads, TC 2 Fasteners and TC 4 Rolling bearings. The standards produced by these TCs are industrially very important, but the topics hardly make peoples’ hearts beat faster. The last 3 TCs in order of establishment are TC 322 Sustainable finance, TC 323 Circular economy and TC 324 Sharing economy. The standards produced by these TCs are important for the financial industries, but probably little known even in financial circles. Between these two extremes we have a large number of TCs, e.g., TC 35 Paints and varnishes, TC 186 Cutlery and table and decorative metal hollow-ware, TC 249 Traditional Chinese medicine, TC 282 Water reuse, TC 297 Waste collection and transportation management, etc.

ISO TCs work on areas of human endeavour that are extremely important to industrial and social life. Many of these activities, however, do not say much to man in the street.

Where is MPEG in this picture? To answer this question I need to dig deeper in the ISO organisation. Most TCs do not have a monolithic structure. They are organised in working groups (WG). TCs retain key functions such as strategy and management, and WGs are tasked to develop standards. In quite a few cases the area of responsibility is so broad that a horizontal organisation would not be functional. In this case a TC may decide to establish Subcommittees (SC). They are like mini TCs where WGs developstandards under them.`

Figure 1 – ISO governance structure

In 1987 ISO/TC 97 Data Processing merged with IEC/TC 83 Information technology equipment. The resulting (joint) technical committee was and is called ISO/IEC JTC 1 Information Technology. One JTC 1 SC, SC 2 Character sets and Information Coding of JTC 1, included WG 8 Coding of Audio and Picture Information. WG 8 established the Moving Picture Experts Group (MPEG) in January 1988. In 1991 when SC 2/WG 8 seceded from SC 2 and became SC 29, MPEG became WG 11  Coding of audio, picture, multimedia and hypermedia information (but everybody calls it MPEG).

MPEG changed the world of media

Those who have survived the description of the ISO organigram will now have the opportunity to understand how this group of experts, in the depths of the ISO (and IEC) organisation changed the world of media and impacted the lives of billions of people, probably all of those on the face of the Earth, if we exclude some hermits in the few surviving tropical forests, the many deserts or the frozen lands.

The main reason of the success of MPEG is that for 30 years it had carte blanche to implement its ideas. Some of them were clear at the outset, others took shape from a process of learning on the job.

Let’s revisit MPEG’s ideas of standardisation to understand what it did and why.

Idea #1 – Single standards for all countries and industries

The first idea relates to the scope of MPEG standards. In the analogue world absence or scarce availability of broadband communication or deliberate policies or the natural separation between industries that traditionally had little in common, favoured the definition of country-based or industry-based standards. The first steps toward digital video undertaken by countries and industries trod similar paths: different countries and industries tried their own way independently.

MPEG jumped in the scene at a time the different trials had not had the time to solidify, and the epochal analogue-to-digital transition gave MPEG a unique opportunity to effect its disruptive action.

MPEG knew that it was technically possible to develop generic standards that could be used in all countries of the world and in all industries that needed compressed digital media. MPEG saw that all actors affected – manufacturers, service providers and end users – would gain if such a bold action was taken. When MPEG began to tread its adventurous path, MPEG did not know  whether it was procedurally possible to achieve that goal. But it gambled and gave it a try. It used the Requirements subgroup to develop generic requirements, acted on the major countries and trade/standards associations of the main industries and magically got their agreement.

The network of liaisons and, sometimes, joint activities is the asset that allowed MPEG to implement idea #1 and helped achieve many of the subsequent goals.

Idea #2 – Standards for the market, not the other way around

Standards are ethereal entities, but their impact is very concrete. This was true and well understood in the world of analogue media. At that time a company that had developed a successful product would try to get a “standard” stamp on it, share the technology with its competitors and enjoy the economic benefits of their “standard” technology.

With its second idea MPEG reshuffled the existing order of steps. Instead of waiting for the market to decide which technology would win – an outcome that very often had little to do with the value of the technology – MPEG offered its standard development process where the collaboratively defined “best” is developed and assessed by MPEG experts who decide which individual technology wins. Then the “standard” technology package developed by MPEG is taken over by the market.

MPEG standards are consistently the best standards at a given time. Those who have technologies selected to be part of MPEG standards reap the benefits and most likely will continue investing in new technologies for future standards.

Idea #3 – Standards anticipate the future

The third idea is a consequence of the first two. MPEG-1 was driven by the expected possibilities of the audio and video compression technologies of the time. It was a bet on silicon making it possible to execute the complex operations implied by the standard so that industry could build products of which there was no evidence but only educated guesses: interactive video on CD and digital audio broadcasting. Ironically, neither really took off, but other products that relied on the MPEG-1 technologies – Video CD and MP3 – were (the former) and still are (the latter) extremely successful.

MPEG standards anticipate market needs. They are regularly bets that a certain standard technology will be adopted. In  More standards – more successes – more failures you can see how some MPEG standards are extremely successful and other less so.

Idea #4 – Industry-friendly standards

The fourth idea was simple and disruptive. Since its first instances in the 1920s, industry and governments have created tens of television formats, mostly around the basic NTSC, PAL and SECAM families. Even in the late 1960’s, when the Picturephone was developed, AT&T invented a new 267-line format, with no obvious connection with any of the existing video formats.

MPEG never wanted to define its own format. With its fourth idea, propped up by the nature of digital technologies, it just decided that it would support any  format. Here is how it did it:

  1. One standard with not options (this should be obvious, because it is what a standard should be about)
  2. Standards apply only to decoders; encoders are implicitly defined and have ample margins of implementation freedom
  3. Profiles (hierarchical, if possible) to accommodate special industry needs within the same standard
  4. Decoders are defined by their ability to process data, quantised in levels (based on bitrate, resolution etc.)
  5. How different formats are handled is outside of MPEG standards.

Idea #5 – Audio and video come together

The fifth idea was kind of obvious but no less disruptive. Because of the way audio and video industries had developed – audio for a century and video for half a century – people working on the corresponding technologies tended to operate in “watertight compartments”, be they in academia, research or companies. That attitude had some justification in the analogue world because the relevant technologies were indeed different and there was not so much added value in keeping the technologies together, considering the big effort needed to keep the experts together.

However, the digital world with its commonality of technologies, no longer justified keeping the two domains separate. That is why MPEG, just 6 months after its first meeting, kicked off the Audio subgroup after successfully assembling in a few months the best experts.

This injection of new technology with the experts that carried it was not effortless. When transformed into digital, audio and video signals are bits and bits and bits, but the sources are different and influence how they are compressed. Audio experts shared some (at a high level) compression technologies – Subband and Discrete Cosine Transform – but video is (was) a 2D signal changing in time often with “objects” in it, while audio is (was) a 1D signal. More importantly, audio experts were driven by other concerns such as the way the human hearing process handles the data coming out of the frequency analysis carried out by the human cochlea.

The audio work was never “dependent” on the video work. MPEG audio standards can have a stand-alone use (i.e. they do not assume that there is a video associated with it), but there is no MPEG video standard that is without an MPEG Audio standard. So it was necessary to keep the two together and it is even more important to do so now when both video and audio are both 3D signals changing in time.

Idea #6 – Don’t forget the glue that keeps audio and video together

The sixth idea can be described by the formula

Audio and Video ≠ (Audio + Video)

This may look cryptic but it states the obvious. Having audio and video together does not necessarily mean that audio and video will play together in the right way if they are stored on a disk or transmitted over a channel.

The fact that MPEG established a Digital Storage Media subgroup and a Systems subgroups 18 months after its foundation signals that MPEG has always been keenly aware of the issue that a bitstream composed by MPEG audio and video bitstreams need to be transported to be played back as intended by the bitstream creator. In MPEG-1 it was a bitstream in a controlled environment, in MPEG-2 it was a bitstream in a noisy environment, from MPEG-4 on it was on IP, in MPEG-DASH it had to deal with unpredictability of the Internet Protocol in the real world.

During its existence the issue of multiplexing and transport formats have shaped MPEG standards. Without a Systems subgroup, efficiently compressed audio and video bitstreams would have remained floating in the space without a standard means to plug them into real systems.

Idea #7 – Integrated standards as toolkits

Most MPEG standards are composed of the 3 key elements – audio, video and systems – that make an audio-visual system and some, such as MPEG-4 and MPEG-I, even include 3D Graphic information. These standards are integrated in the sense that, if you need a complete solution, you can get what you need from the package offered by MPEG.

The world is more complicated than that. Some users want to cherry pick technologies. In the case of MPEG-I, most likely MPEG will not standardise a Scene Description technology but will just indicate how externally defined technologies can be plugged into the syste.

With its seventh idea MPEG is ready to satisfy the needs of all customers. It defines the means to signal how an external technology can be plugged into a set of other native MPEG technologies. With one caveat: customer has to take care of the integration of the external technology. That MPEG will not do.

Idea #8 – Technology is always on the move

To describe the eight idea, I will seek help from the Greek philosopher Heraclitus (or whoever was the person who said it): τὰ πάντα ῥεῖ καὶ οὐδὲν μένει (everything flows and nothing stays). Digital technologies move fast and actually accelerate. By applying idea #3, #4, #5, #6 and #7, MPEG standards accelerated the orderly transition of analogue to digital media. By applying ideas #1 and #2, MPEG standards prompted technology convergence with its merging of industry segments and appearance of new players.

The seventh idea reminds MPEG that the technology landscape is constantly changing and this awareness must inform its standards. Until HEVC – one can even say, including the upcoming Versatile Video Coding (VVC) – video meant coding a 2D rectangular area (in MPEG-4, a flat area of any shape). The birth of immersive visual experiences is not without pain, but they are becoming possible and MPEG must be ready with solutions that take this basic assumption into account. This means that, in the technology scenario that is shaping up, the MPEG role of “anticipatory standards” is ever more important and ever more challenging to achieve.

Idea #9 – The nature and borders of compression

The ninth idea goes down to the very nature of compression. What is the meaning of compression? Is it “less bits is always good” or can it also be “as few meaningful bits as possible is also good”? The former is certainly desirable but, as the nature of information consumption changes and compression digs deeper in the nature of information, compressed representations that offer easier access to the information embedded in the data becomes more valuable.

What is the scope of application of MPEG compression? When MPEG started the MPEG-1 standards work, the gap that separated the telecom from the CE industries (the first two industries in attendance at that time) were as wide as the media industry and, say, the genomic industries today. Both are digital now and the dialogue gets easier.

With patience and determination MPEG has succeeded in creating a common language and mind set in the media industries. This is an important foundation of MPEG standards, The same amalgamation will continue between MPEG and other industries.

Now the results

Figure 2 intends to attach some concreteness to the nine ideas illustrated above by showing some of the most successful MPEG standards issued from 31 years of MPEG activity.

 

Figure #2 – Some successful MPEG standards

Conclusion

An entity at the lowest layer of the ISO hierarchy has masterminded the transition of media from the analogue to the digital world. Its standards underpin the evolution of digital media, foster the creation of new industries and offer unrelenting growth to old and new industries worth in excess of 1 trillion USD per year.

Many thanks to the parent body SC 29 for managing the balloting of MPEG standards.

Posts in this thread

Data compression in MPEG

That video is a high profile topic to people interested in MPEG is obvious – MP stands for Moving Pictures – and is shown by the most visited article in this blog Forty years of video coding and counting. Audio is also a high profile topic, so it should not be a surprise given that the official MPEG title is “Coding of Moving Pictures and Audio” and is confirmed by the fact that Thirty years of audio coding and counting has received almost the same amount of visits as the previous one.

What is less known, but potentially very important, is the fact that MPEG has already developed a few standards for compression of a wide range of other data types. Point Cloud is the data type that is acquiring a higher profile by the day is, but there are many more types, as represented by the table below.

Figure 1 – Data types and relevant MPEG standards

 

Video

The articles Forty years of video coding and counting and More video with more features provide a detailed history of video compression in MPEG from two different perspectives. Here I will briefly list the video-coding related standards produced or being produced by MPEG mentioned in the table.

  • MPEG-1 and MPEG-2 both produced widely used video coding standards.
  • MPEG-4 has been much more prolific.
    • It started with Part 2 Visual
    • It continued with Part 9 Reference Hardware Description, a standard that supports a reference hardware description of the standard expressed in VHDL (VLSI Hardware Description Language), a hardware description language used in electronic design automation.
    • Part 10 is the still high-riding Advanced Video Coding standard.
    • Part 29, 31 and 33 are the result of three attempts at developing Option 1 video compression standards (in a simple but imprecise way, standards that do not require payment of royalties).
  • MPEG-5 is currently expected to be a standard with 2 parts:
    • Part 1 Essential Video Coding will have a base layer/profile which is expected to be Option 1 and a second layer/profile with a performance ~25% better than HEVC. Licensing terms are expected to be published by patent holders within 2 years.
    • Part 2 Low Complexity Enhancement Video Coding (LCEVC) will be a two-layer video coding standard. The lower layer is not tied to any specific technology and can be any video codec; the higher layer is used to extend the capability of an existing video codec.
  • MPEG-7 is about Multimedia Content Description. There are different tools to describe visual information:
    • Part 3 Visual is a form of compression as it provides tools to describe Color, Texture, Shape, Motion, Localisation, Face Identity, Image signature and Video signature.
    • Part 13 Compact Descriptors for Visual Search can be used to compute compressed visual descriptors of an image. An application is to get further information about an image captured e.g. with a mobile phone.
    • Part 15 Compact Descriptors for Video Analysis allows to manage and organise large scale data bases of video content, e.g. to find content containing a specific object instance or location.
  • MPEG-C is a collection of video technology standard that do not fit with other standards. Part 4 – Media Tool Library is a collection of video coding tools (called Functional Units) that can be assembled using the technology standardised in MPEG-B Part 4 Codec Configuration Representation.
  • MPEG-H part 2 High Efficiency Video Coding is the latest MPEG video coding standard with an improved compression of 60% compared to AVC.
  • MPEG-I is the new standard, mostly under development, for immersive technologies
    • Part 3 Versatile Video Coding is the ongoing project to develop a video compression standard with an expected 50% more compression than HEVC.
    • MPEG-I part 7 Immersive Media Metadata is the current project to develop a standard for compressed Omnidirectional Video that allows limited translational movements of the head.
    • Exploration in 6 Degrees of Freedom (6DoF) and Lightfield are ongoing.

Audio

The article Thirty years of audio coding and counting provides a detailed history of audio compression in MPEG. Here I will briefly list the audio-coding related standards produced or being produced by MPEG mentioned in the table.

  • MPEG-1 part 3 Audio produced, among others, the foundational digital audio standard better known as MP3.
  • MPEG-2
    • Part 3 Audio extended the stereo user experience of MPEG-1 to Multichannel.
    • Part 7 Advanced Audio Coding is the foundational standard on which MPEG-4 AAC is based.
  • MPEG-4 part 3 Advanced Audio Coding (AAC) currently supports some 10 billion devices and software applications growing by half a billion unit every year.
  • MPEG-D is a collection of different audio technologies:
    • Part 1 MPEG Surround provides an efficient bridge between stereo and multi-channel presentations in low-bitrate applications as it can transmit 5.1 channel audio within the same 48 kbit/s transmission budget.
    • Part 2 Spatial Audio Object Coding (SAOC) allows very efficient coding of a multi-channel signal that is a mix of objects (e.g. individual musical instruments).
    • Part 3 Unified Speech and Audio Coding (USAC) combines the tools for speech coding and audio coding into one algorithm with a performance that is equal or better than AAC at all bit rates. USAC can code multichannel audio signals, and can also optimally encode speech content.
    • Part 4 Dynamic Range Control is a post-processor for any type of MPEG audio coding technology. It can modify the dynamic range of the decoded signal as it is being played.

2D/3D Meshes

Polygons meshes can be used to represent the approximate shape of a 2D image or a 3D object. 3D mesh models are used in various multimedia applications such as computer game, animation, and simulation applications. MPEG-4 provides various compression technologies

  • Part 2 Visual provides a standard for 2D and 3D Mesh Compression (3DMC) of generic, but static, 3D objects represented by first-order (i.e., polygonal) approximations of their surfaces. 3DMC has the following characteristics:
    • Compression: Near-lossless to lossy compression of 3D models
    • Incremental rendering: No need to wait for the entire file to download to start rendering
    • Error resilience: 3DMC has a built-in error-resilience capability
    • Progressive transmission: Depending on the viewing distance, a reduced accuracy may be sufficient
  • Part 16 Animation Framework eXtension (AFX) provides a set of compression tools for Shape, Appearance and Animation.

Face/Body Animation

Imagine you have a face model that you want to animate from remote. How do you represent the information that animates the model in a bit-thrifty way? MPEG-4 Part 2 Visual has an answer to this question with its Facial Animation Parameters (FAP). FAPs are defined at two levels.

  • High level
    • Viseme (visual equivalent of phoneme)
    • Expression (joy, anger, fear, disgust, sadness, surprise)
  • Low level: 66 FAPs associated with the displacement or rotation of the facial feature points.

In the figure feature points affected by FAPs are indicated as a black dot. Other feature point are indicated as a small circle.

Figure 2 – Facial Animation Parameters

It is possible to animate a default face model in the receiver with a stream of FAPs or a custom face can be initialised by downloading Face Definition Parameters (FDP)  with specific background images, facial textures and head geometry.

MPEG-4 Part 2 uses a similar approach for Body Animation.

Scene Graphs

So far MPEG has never developed a Scene Description technology. In 1996, when the development of the MPEG-4 standard required it, it took the Virtual Reality Modelling Language (VRML) and extended it to support MPEG-specific functionalities. Of course compression could not be absent from the list. So the Binary Format for Scenes (BiFS), specified in MPEG-4 Part 11 Scene description and application engine was born to allow for efficient representation of dynamic and interactive presentations, comprising 2D & 3D graphics, images, text and audiovisual material. The representation of such a presentation includes the description of the spatial and temporal organisation of the different scene components as well as user-interaction and animations.

In MPEG-I scene description is playing again an important role. However, MPEG this time does not even intend to pick a scene description technology. It will define instead some interface to a scene description parameters.

Font

Many thousands of fonts are available today for use as components of multimedia content. They often utilise custom design fonts that may not be available on a remote terminal. In order to insure faithful appearance and layout of content, the font data have to be embedded with the text objects as part of the multimedia presentation.

MPEG-4 part 18 Font Compression and Streaming defines and provides two main technologies:

  • OpenType and TrueType font formats
  • Font data transport mechanism – the extensible font stream format, signaling and identification

Multimedia

Multimedia is a combination of multiple media in some form. Probably the closest multimedia “thing” in MPEG is the standard called Multimedia Application Formats. However, MPEG-A is an integrated package of media for specific applications and does not does define any specific media format. It only specifies how you can combine MPEG (and sometimes other) formats.

MPEG-7 part 5 Multimedia Description Schemes (MDS) specifies the different description tools that are not visual and audio, i.e. generic and multimedia. By comprising a large number of MPEG-7 description tools from the basic audio and visual structures MDS enables the creation of the structure of the description, the description of collections and user preferences, and the hooks for adding the audio and visual description tools. This is depicted in Figure 3.

Figure 3 – The different functional groups of MDS description tools

Neural Networks

Requirements for neural network compression have been exposes in Moving intelligence around. After 18 months of intense preparation with development of requirements, identification of test material, definition of test methodology and drafting of a Call for Proposals(CfP), at the March 2019 (126th) meeting , MPEG analysed nine technologies submitted by industry leaders. The technologies proposed compress neural network parameters to reduce their size for transmission, while not or only moderately reducing their performance in specific multimedia applications. MPEG-7 Part 17 Neural Network Compression for Multimedia Description and Analysis is the standard, the part and the title given to the new standard.

XML

MPEG-B part 1 Binary MPEG Format for XML (BiM) is the current endpoint of an activity that started some 20 years ago when MPEG-7 Descriptors defined by XML schemas were compressed in a standard fashion by MPEG-7 Part 1 Systems. Subsequently MPEG-21 needed XML compression and the technology was extended in Part 15 Binary Format.

In order to reach high compression efficiency BiM relies on schema knowledge between encoder and decoder. It also provides fragmentation mechanisms to provide transmission and processing flexibility, and defines means to compile and transmit schema knowledge information to enable decompression of XML documents without a priori schema knowledge at the receiving end.

Genome

Genome is digital, and can be compressed presents the technology used in MPEG-G Genomic Information Representation. Many established compression technologies developed for compression of other MPEG media have found good use in genome compression. MPEG is currently busy developing the MPEG-G reference software and is investigating other genomic areas where compression is needed. More concretely MPEG plans to issue a Call for Proposal for Compression of Genome Annotation at its July 2019 (128th) meeting.

Point Clouds

3D point clouds can be captured with multiple cameras and depth sensors with points that can number a few thousands up to a few billions, and with attributes such as colour, material properties etc.

MPEG is developing two different standards whose choice depends on whether the point cloud is dense (this is done in MPEG-I Part 5 Video-based Point Cloud Compression) or less so (MPEG-I Part 9 Graphic-based PCC). The algorithms in both standards are lossy, scalable, progressive and support random access to subsets of the point cloud.

MPEG plans to release Video-based PCC as FDIS in October 2019 and Graphic-based PCC Point Cloud Compression as FDIS in April 2020.

Sensors/Actuators

MPEG felt the need to address compression for data from sensor and data to actuator when it considered the exchange of information taking place between the physical world where the user is located and any sort of virtual world generated by MPEG media.

So MPEG undertook the task to provide standard interactivity technologies that allow a user to

  • Map their real-world sensor and actuator context to a virtual-world sensor and actuator context, and vice-versa, and
  • Achieve communication between virtual worlds.

Figure 3 describes the context of the MPEG-V Media context and control standard.

Figure 3 – Communication between real and virtual worlds

The MPEG-V standards defines several data types and their compression

  • Part 2 – Control information specifies control devices interoperability (actuators and sensors) in real and virtual worlds
  • Part 3 – Sensory information specifies the XML Schema-based Sensory Effect Description Language to describe actuator commands such as light, wind, fog, vibration, etc. that trigger human senses
  • Part 4 – Virtual world object characteristics defines a base type of attributes and characteristics of the virtual world objects shared by avatars and generic virtual objects
  • Part 5 – Data formats for interaction devices specifies syntax and semantics of data formats for interaction devices – Actuator Commands and Sensed Information – required to achieve interoperability in controlling interaction devices (actuators) and in sensing information from interaction devices (sensors) in real and virtual worlds
  • Part 6 – Common types and tools specifies syntax and semantics of data types and tools used across MPEG-V parts.

MPEG-IoMT Internet of Media Things is the mapping of the general IoT context to MPEG media developed by MPEG. MPEG-IoMT Part 3 – IoMT Media Data Formats and API also addresses the issue of media-based sensors and actuators data compression.

What is next in data compression?

In Compression standards for the data industries I reported the proposal made by the Italian ISO member body to establish a Technical Committee on Data Compression Technologies. The proposal was rejected on the ground that Data Compression is part of Information Technology.

It was a big mistake because it has stopped the coordinated development of standards that would have fostered the move of different industries to the digital world. The article identified a few such as Automotive, Industry Automation, Geographic information and more.

MPEG has done some exploratory work and found that there quite a few of its existing standards could be extended to serve new application areas. One example is the conversion of MPEG-21 Contracts to Smart Contracts. An area of potential interest is data generated by machine tools in industry automation.

Conclusions

MPEG audio and video compression standards are the staples of the media industry. MPEG continues to develop those standards while investigating compression of other data types in order to be ready with standards when the market matures. Point clouds and DNA reads from high speed sequencing machines are just two examples of how, by anticipating market needs, MPEG prepares to serve timely the industry with its compression standards.

Posts in this thread

More video with more features

In Forty years of video coding and counting I presented a short but intense history of ITU and MPEG video compression standards. In this article I will focus on how more functionalities got added to video compression over the years to MPEG standards and how the next generation of standards will add even more.

The table below gives an overview of all MPEG video compression standards – past, present and planned. Those in italic have not reached Final Draft International Standard (FDIS) level.

Figure 1 – Video coding standards and functionalities

In 1988 MPEG started its first video coding project for interactive video applications on compact disc (MPEG-1). Input video was assumed to be progressive (25/29.97 Hz, but it also supported more frame rates) and spatial resolution was Source Image Format (CIF ), i.e. 240 or 288 lines x 352 pixels. The syntax supported spatial resolutions up to 16 Kpixels. Obviously progressive scanning is a feature that all MPEG video coding standards have supported since MPEG-1. The (obvious) exception is point clouds because there are no “frames”.

In 1990 MPEG started its second video coding project targeting digital television (MPEG-2). Therefore the input was assumed to be interlaced (frame rate of 50/59.94 Hz, but it also supported more frame rates) and spatial resolution was standard/high definition, and up. The resolution space was quantised by means of levels, the second dimension after profiles. MPEG-4 Visual and AVC are the two last standards with specific interlace tools. An attempt was made to introduce interlace tools in HEVC but the technologies presented did not show appreciable improvements if compared with progressive tools. HEVC does have have some indicators (SEI/VUI) to tell the decoder that the video is interlaced.

MPEG-2 was the first standard to tackle scalability (High Profile), multiview (Multiview Profile) and higher croma resolution (4:2:2 Profile). Several subsequent video coding standards (MPEG-4 Visual and AVC and HEVC) also support these new features. VVC is expected to do the same, probably not in version 1.

MPEG-4 Visual supports coding of video objects and error resilience. The first feature has remained specific to MPEG-4 Visual. Most video codecs allow for some error resilience (e.g. starting from slices in MPEG-1). However, MPEG-4 Visual – mobile communication being one relevant use case – was the first to specifically consider error resilience as a tool.

MPEG-2 first tried to develop 10-bit support and the empty part 8 is what is left of that attempt.

Wide Colour Gamut (WCG), High Dynamic Range (HDR) and 3 Degrees of Freedom (3DoF)  are all supported by AVC.  These functionalities were first introduced in HEVC, and later added to AVC and are planned to be supported in VVC as well. WCG allows to display a wider gamut of colours, HDR allows to display pictures with brighter regions and with more visible detail in dark areas, SCC allows to achieve better compression of non natural (synthetic) material such as characters and graphics and 3DoF (also called Video 360) allows to represent pictures projected on a sphere.

AVC supports more than 8 quantisation bits extended to 14 bits. HEVC even support 16 bits. VVC, EVC and LCEVC are expected to also support more than 8 quantisation bits.

WebVC was the first MPEG attempt at defining a video coding standard that would not require a licence that involves payment of fees (Option 1 in ISO language, legal language more complex than this). Strictly speaking, WebVC is not a new standard because MPEG has simply extracted what was the Constrained Baseline Profile in AVC (originally, AVC tried to define an Option 1 profile but did not achieve the goal and did not define the profile) with the hope that WebVC could achieve Option 1 status. The attempt failed because some companies confirmed their Option 2 patent declarations (i.e. a licence is required to use the standard) already made against the AVC standard. The brackets in the figure convey this fact.

Video Coding for Browsers (VCB) is the result of a proposal made by a company in response to an MPEG Call for Proposals for Option 1 video coding technology. Another company made an Option 3 patent declaration (i.e. unavailability to license the technology). As the declaration did not contain any detail that could allow MPEG to remove the allegedly infringing technologies, ISO did not publish VCB as a standard. The square brackets in the figure convey this fact.

Internet Video Coding (IVC) is the third video coding standard intended to be Option 1. Three Option 2 patent declarations were received and MPEG has declared its availability to remove patented technology from the standard if specific technology claims will be made. The brackets convey this fact.

Finally, Essential Video Coding (EVC), part 1 of MPEG-5 (however, the project has not been formally approved by ISO yet), is expected to be a two-layer video coding standard. The EVC Call for Proposals requested that the technologies provided in response to the Call for the first (lower) layer of the standard be Option 1. Technologies for the second (higher) layer are Option 2. The curled brackets in the figure convey this fact.

Screen Content Coding (SCC) SCC allows to achieve better compression of non natural (synthetic) material such as characters and graphics. It is supported by HEVC and is planned to be supported in VVC and possibly EVC.

Low Complexity Enhancement Video Coding (LCEVC) is another two-layer video coding standard. Unlike EVC, however, in LCEVC the lower layer is not tied to any specific technology and can be any video codec. The goal of the 2nd layer is to extend the capability of an existing video codec. A typical usage scenario is to give a large amount of already deployed standard definition set top boxes that cannot be recalled the ability to decode high definition pictures. The LCEVC decoder is depicted in Figure 2.

Figure 2 – Low Complexity Enhancement Video Coding

Today technologies are available to capture 3D point clouds, typically with multiple cameras and depth sensors producing up to billions of points for realistically reconstructed scenes. Point clouds can have attributes such as colors, material properties and/or other attributes and are useful for real-time communications, GIS, CAD and cultural heritage applications. MPEG-I part 5 will specify lossy compression of 3D point clouds employing efficient geometry and attributes compression, scalable/progressive coding, and coding of point clouds sequences captured over time with support of random access to subsets of the point cloud.

Other technologies capture points clouds potentially with low density of points to allow users to freely navigate in multi-sensory 3D media experiences. Such representations require a large amount of data, not feasible for transmission on today’s networks. MPEG is developing a second, graphics-based PCC standard, as opposed to the previous one which is video-based, for efficient compression of sparse point clouds.

3DoF+ is a terms used by MPEG to indicate a usage scenario where the user can have translational movements of the head. In a 3DoF scenario if the user moves the head too much, annoying parallax error is felt. In March 2019 MPEG has received responses to its Call for Proposals requesting appropriate metadata (see the red blocks in Figure 3) to help the Post-processor present the best image based on the viewer’s position if available, or to synthesise a missing one, if not available.

Figure 3 – 3DoF+ use scenario

6DoF indicates a use scenario where the user can freely move in a space and enjoy a 3D virtual experience that matches the one in the real world. Light field refers to new devices that can capture a spatially sampled version of a light field that has both spatial and angular light information in one shot. The size of captured data is not only larger but also different than traditional camera images. MPEG is investigating new and compatible compression methods for potential new services.

In 30 years compressed digital video has made a lot of progress, e.g., bigger and brighter pictures with less bitrate and other features. The end point is nowhere in sight.

Thanks to Gary Sullivan and Jens-Rainer Ohm for useful comments.

Posts in this thread

Matching technology supply with demand

Introduction

There have always been people in need of technology and, most of the time, people ready to provide something in response to the demand. In book XVIII of Iliad, Thetis, Achilles’s mother, asks Hephestus, the god of, blacksmiths, craftsmen, artisans, sculptors and more, to provide a new armour to her son who had lost it to Hector. Hephestus duly complied. Still in the fictional domain, but in more recent years, Agent 007 visits Q Branch to get the latest gadgets for his next spy mission, which are inevitably put to good use in the mission.

Wars have always been times when the need for technologies stretches the ability to supply them. In our, supposedly peaceful, age, there are lots of technology around, but it is often difficult for companies needing a particular technology to find the solution matching their needs and budget.

Supply and demand in standardisation

Standardisation is an interesting case of an entity,  typically non-commercial and non-governmental, needing technologies to make a standard. Often standards organisations, too, need technologies to accomplish their mission. How can they access the needed technologies?

A few decades back, if industry needed a standard for, say, a video cassette recorder, the process was definitely supply-driven: a company who had developed a successful product (call it Sony or JVC) submitted a proposal (call it Betamax or VHS) to a standards committee (call it IEC SC 60B). Too bad if the process produced two standards.

In the second half of the 1980’s, ITU SG XV (ITU numbering of that time) started developing the H.261 recommendation. Experts developed the standard piece by piece in the committee (so-called Okubo group) by acquiring the necessary technologies in a process where the roles of demand and supply were rather blurred. Only participants were entitled to provide their technologies to fulfill the needs of the standard.

A handful of years later, MPEG further innovated technology procurement in a standardisation environment. To get the technologies needed to make a standard, it used a demand-driven tool – MPEG’s Call for Proposals (CfP). Since then, technologies provided by respondents and assessed for relevance by the group are used to 1) create the initial reference model (RM0) and 2) initiate a first round of Core Experiments (CE). CEs result from the agreement among participating experts that there is room for improving the performance of the standard under development by opening a particular area to optimisation. CEs are continued until available room for optimisation is exhausted. While anybody is entitled to respond to a CfP and contribute technology to RM0, only experts participating in the standardisation project can provide technology for CEs. This, however, is not really a limitation because the process is open to anybody wishing to join a recognised standards organisation who is a member of ISO.

The MPEG process of standards development has allowed the industry to maintain a sustained development and expansion for many years. In fairness, this is not entirely MPEG’s merit. Patent pools have played a synergistic role to MPEG’s by providing (industry) users with the means to practice MPEG standards and IP holders the means to be remunerated for the use of their IP.

The situation today

The HEVC case has shown that the cooperation of different parties to achieve the common goal of enabling the use of a standard is not discounted (see, e.g., A crisis, the causes and a solution). There are several reasons for this: the increasing number of individual technologies needed to make a high-performance MPEG standard, the increasing number of IP holders, the increasing number of Non-Performing Entities (NPE) as providers of technology and the increasing number of patent pools who stand as independent licence providers of a portion of a patent.

I have already made several proposals with the intention of helping MPEG from this stalemate (see, e.g., Business model based ISO/IEC standards). Here I would like to present an additional idea that extends the MPEG process of standards development (see How does MPEG actually work?).

A new process proposal

A possible implementation of the proposal applying to an entity (a company, an industry forum or a standards organisation) wishing to develop a specification or a standard, could run like this (some details could be fine tuned or changed on a case-by-case basis):

  1. An entity (a company, an industry forum or a standards organisation) wishes to develop a specification or a standard
  2. The entity issues a Call for Proposals (CfP) including requirements requesting proponents to to accept the process (as defined in this numbered list) and to commit to RAND licensing of their technologies for the specification or standard
  3. The entity assesses the proposals received
  4. The entity sets aside a certain amount of tokens for the entire standard (e.g. 1,000)
  5. The entity builds a “minimal” Reference Model zero (RM0) using technologies contained in the proposals in a conservative way so as to create ample space for a healthy Core Experiment (CE) process
  6. The entity
    1. Assigns a percentage of tokens to RM0
    2. Establishes the amount of tokens that will be given to the proponent who achieves the highest performance in a CE (e.g. 1 token for each 0.1% improvements)
  7. The entity identifies and publishes CEs at the pace required by the specification or standard
  8. For each CE the entity makes a call containing
    1. A description of the CE
    2. A minimum performance target (say, at least 1% improvement)
    3. A deadline for submission of 1) results and 2) code that proves that the target has been achieved
    4. The maximum level of associated complexity
  9. CE proponents should do due diligence and
    1. Make proposals that contain only their own technologies, or
    2. Ask any third party to join in the response (ad accept the conditions of the CfP)
  10. If the tokens are all used and there is still room for optimisation, new tokens are created and token holders have their tokens scaled down so that the total number of tokens is still 1,000
  11. If room for RM optimisation is exhausted but there are still tokens unassigned, token holders have their tokens scaled up so that the total number of tokens is 1,000.

Depending on the nature of the entity (company-industry forum-standards organisation) another entity, which can be the same entity who has managed the process or a patent pool

  1. Identifies who are IP holders in RM0 and CEs
  2. Removes technology in case the status of IP has not been clarified
  3. After completing steps 1 and 2, pays royalties to IP holders based on the number of tokens they have acquired in the process (i.e. RM0 and CEs)

Merits and limits of the proposal

The proposal achieves the goal to

  1. Associate patent holders to RM0 and CE areas as opposed to associate patent holders just to the standard
  2. Enablthe turning off of technologies of a CE area if this has unclear IP status and turning them on again if the status is clarified

In case the entity developing the specification is a standards organisation, more than one patent pool can develop a licence using the results of the process.

Conclusions

This idea was developed in collaboration with Malvika Rao, the founder of Incentives Research and holder of a PhD from Harvard University, and Don Marti, an open source expert and an advisor at Incentives Research.

Converting the basic concept described above into a workable market design requires further work. There may be opportunities to game the system, and the design must consider issues such as how to attract and retain participation. In addition the design must be tested (e.g., via simulation or usability study) to understand its performance.

Please send comments to Leonardo.

Posts in this thread

 

 

 

What would MPEG be without Systems?

The most visited articles on this blog Forty years of video coding and counting and Thirty years of audio coding and counting prove that MPEG is known for its audio and video coding standards. But I will not tire of saying that MPEG would not it be what it has become if the Systems aspects had not been part of most of its standards. This is what I intend to talk about in this article.

It is hard to acknowledge, but MPEG was not the first to deal with the problem of putting together digital audio and video for delivery purposes. In the second half of the 1990’s ITU-T dealt with the problem of handling audio-visual services using the basic ISDN access at 2B+D (2×64 kbit/s) or the primary ISDN access, the first digital streams made possible by ITU-T Recommendations.

Figure 1 depicts the solution specified in ITU Recommendation H.221. Let’s assume that we have 2 B channels at 64 kbit/s (Basic Access ISDN). H.221 creates on each B channel a Frame Structure of 80 bytes, i.e. 640 bits repeating itself 100 times per second. Each bit position in an octet can be considered as an 8 kbit/s sub-channel. The 8th bit in each octet represents the 8th sub-channel, called the Service Channel.

Within the Service Channel bits 1-8 are used by the Frame Alignment Signal (FAS) and bits 9-16 are used by the Bit Alignment Signal (BAS). Audio is always carried by the first B channel, e.g. by the first 2 subchannels, and Video and Data by the other subchannels (less the bitrate allocated to FAS and BAS).

Figure 1 – ITU Recommendation H.221

MPEG-1 Systems

The solution depicted in Figure 1 bears the mark of the transmission part of the telecom industry that had never been much friendly to packet communication. That is why MPEG in the late 1990’s had an opportunity to bring some fresh air in this space. Starting from a blank sheet of paper (at that time MPEG still used paper 😊) MPEG designed a flexible packet-based multiplexer to convey in a single stream compressed audio and video, and clock information in such a way as to enable audio‑video synchronisation (Figure 2).

Figure 2 – MPEG-1 Systems

The MPEG Systems revolution took time to take effect. Indeed the European EU 95 project used MPEG-1 Audio layer 2, but designed a frame-based multiplexer for the Digital Audio Broadcasting service.

MPEG-2 Systems

In the early 1990’s MPEG started working on another blank sheet of paper. MPEG had the experience of MPEG-1 Systems design but the requirements were significantly different. In MPEG-1, audio and video (possibly many of them in the same stream) had a common time base, but the main users of MPEG-2 wanted a system that could deliver a plurality of TV programs, possibly coming from different sources (i.e. with different time bases) and with possibly a lot of metadata related to the programs, not to mention some key business enabler like conditional access information. Moreover, unlike MPEG-1 where it was safe to assume that the bits issuing from a Compact Disc would travel without errors to a demultiplexer, in MPEG-2 it was mandatory to assume that the transmission channel was anything but error-free.

MPEG-2 Transport Stream (TS) provides efficient mechanisms to multiplex multiple audio-visual data streams into one delivery stream. Audio-visual data streams are packetised into small fixed-size packets and interleaved to form a single stream. Information about the multiplexing structure is interleaved with the data packets so that the receiving entity can efficiently identify a specific stream. Sequence numbers help identify missing packets at the receiving end, and timing information is assigned after multiplexing with the assumption that the multiplexed stream will be delivered and played in sequential order.

MPEG-2 Systems is actually two specifications in one (Figure 3). The Transport Stream (TS) is a fixed-length packet-based transmission system designed to work for digital television distribution on error-prone physical channels, while the Program Stream (PS) is a packet-based multiplexer with many points in common with MPEG-1 Systems. `While TS and PS share significant information, moving from one to the other may not be immediate.

Figure 3 – MPEG-2 Systems

MPEG-4 Systems

MPEG-4 gave MPEG the opportunity to experience an epochal transition in data delivery. When MPEG-2 Systems was designed Asynchronous Transfer Mode (ATM) was high on the agenda of the telecom industry and was considered as the vehicle to transport MPEG-2 TS streams on telecommunication networks. Indeed, the Digital Audio-Visual Council (DAVIC) designed its specifications on that assumption. At that time, however, IP was still unknown to the telecom (at least to the transmission part, broadcast and consumer electronics worlds.

The MPEG-4 Systems work was a completely different story than MPEG-2 Systems. An MPEG4 Mux (M4Mux) was developed along the lines of MPEG-1 and MPEG-2 Systems, but MPEG had to face an unknown world where many transports were surging as possible candidates. MPEG was obviously unable to make choices (today, 25 years later, the choice is clear) and developed the notion of Delivery Multimedia Integration Framework (DMIF), where all communications and data transfers between the data source and the terminal were abstracted through a logical API called the DAI (DMIF Application Interface), independent of the transport type (broadcast, network, storage).

MPEG-4 Systems, however, was about more than interfacing with transport and multiplexing. The MPEG-4 model was a 3D space populated with dynamic audio, video and 3D Graphics objects. Binary Format for Scenes (BIFS) was the technology designed to provide the needed functionality.

Figure 4 shows the 4 MPEG-4 layers: Transport, Synchonisation, Compression and Composition.

Figure 4 – MPEG-4 Systems

MPEG-4 File Format

For almost 10 years – until 1997 – MPEG was a group who made intense use of IT tools (in the form of computer programs that simulated encoding and decoding operation of the standards it was developing) but was not an “IT group”. The proof? Until that time it had not developed a single file format. Today MPEG can claim to have another such attribute (IT group) along with the many others it has.

In those years MP3 files were already being created and exchanged by the millions, but the files did not provide any structure. The MP4 File Format, officially called ISO Base Media File Format (ISO BMFF), filled that gap as it can be used for editing, HTTP streaming and broadcasting.

Let’s have a high level look to understand the sea that separates MP3 files from the MP4 FF. MP4 FF contains tracks for each media type (audio, video etc.), with additional information: a four-character the media type ‘name’ with all parameters needed by the media type decoder. “Track selection data” helps a decoder identify what aspect of a track can be used and to determine which alternatives are available.

Data are stored in a basic structure called box with attributes of length, type (4 printable characters), possibly version and flags. No data can be found outside of a box. Figure 5 shows a possible organisation of an MP4 file

 

Figure 5 – Boxes in an MP4 File

MP4 FF can store:

  1. Structural and media data information for timed presentations of media data (e.g. audio, video, subtitles);
  2. Un-timed data (e.g. meta-data);
  3. Elementary stream encryption and encryption parameter (CENC);
  4. Media for adaptive streaming (e.g. DASH);
  5. High Efficiency Image Format (HEIF);
  6. Omnidirectional Media Format (OMAF);
  7. Files partially received over lossy links for further processing such as playback or repair (Partial File Format);
  8. Web resources (e.g. HTML, JavaScript, CSS, …).

Save for the first two features, all others were added in the years following 2001 when MP4 FF was approved. The last two are still under development.

MPEG-7 Systems

With MPEG-7, MPEG made the first big departure from media compression and turned its attention to media description including ways to compress that information. In addition to descriptors for visual audio and multimedia information, MPEG-7 includes a Systems layer used by an application, say, navigation of a multimedia information repository, to access coded information coming from a delivery layer in the form of coded descriptors (in XML or in BiM, MPEG’s XML compression technology). The figure illustrates the operation of MPEG-7 Systems decoder.

Figure 6 – MPEG-7 Systems

An MPEG-7 Systems decoder operates in two phases

  1. Initialisation when DecoderInit initialises the decoder by conveying description format information (textual or binary), a list of URIs that identifies schemas, parameters to configure the Fragment Update decoder, and an initial description. The list of URIs is passed to a schema resolver that associates the URIs with schemas to be passed to Fragment Update Decoder.
  2. Main operation, when the Description Stream (composed of Access Units containing fragment updates) is fed to the decoder which processes
    1. Fragment Update Command specifying the update type (i.e., add, replace or delete content or a node, or reset the current description tree);
    2. Fragment Update Context that identifies the data type in a given schema document, and points to the location in the current description tree where the fragment update command applies; and
    3. Fragment Update Payload conveying the coded description fragment to be added toor replaced in the description.

MPEG-E

MPEG Multimedia Middleware (M3W), also called MPEG-E, is an 8-part standard defining the protocol stack of consumer-oriented multimedia devices, as depicted in Figure 7.

Figure 7 – MPEG Multimedia Middleware (M3W)

The M3W model includes 3 layers:

  1. Applications non part of the specifications but enabled by the M3W Middleware API;
  2. Middleware consisting of
    1. M3W middleware exposing the M3W Middleware API;
    2. Multimedia platform supporting the M3W Middleware by exposing the M3W Multimedia API;
    3. Support platform providing the means to manage the lifetime of, and interaction with, realisation entities by exposing the M3W Support API (it also enables management of support properties, e.g. resource management, fault management and integrity management);
  3. Computing platform: whose API are outside of M3W scope.

MPEG-M

Multimedia service platform technologies (MPEG-M) specifies two main components of a multimedia device, called peer in MPEG-M.

As shown in Figure 8, the first component is API: High-Level for applications and Low Level for network, energy and security.

Figure 8 – High Level and Low Level API

The second components is a middleware called MXM that relies on its multimedia technologies

Figure 9 – The MXM architecture

The Middleware is composed of two types of engine. Technology Engines are used to call functionalities defined by MPEG standards such as creating or interpreting a licence attached to a content item. Protocol Engines are used to communicate with other peer, e.g. in case a peer does not have a particular Technology Engine that another peer has. For instance, a peer can use a Protocol Engine to call a licence server to get a licence to attach to a multimedia content item. The MPEG-M middleware has the ability to create chains of Technology Engines (Orchestration) or Protocol Engines (Aggregation).

MMT

MPEG Media Transport (MMT) is part 1 of High efficiency coding and media delivery in heterogeneous environments (MPEG-H). It is the solution for the new world of broadcasting where delivery of content can take place over different channels each with different characteristics, e.g. one-way (traditional broadcasting) and two-way (the ever more pervasive broadband network). MMT assumes that the Internet Protocol is common to all channels.

Figure 10 depicts the MMT protocol stack

Figure 10 – The MMT protocol stack

Figure 11 focuses on the MMT Payload, i.e. on the content structure.

Figure 11 – Structure of MMT Payload

The MMT Payload has an onion-like structure:

  1. . Media Fragment Unit (MFU), the atomic unit which can be independently decoded;
  2. Media Processing Unit (MPU), the atomic unit for storage and consumption of MMT content (structured according to ISO BMFF), containing one or more MFUs;
  3. MMT Asset, the logical unit for elementary streams of multimedia component, e.g. audio, video and data, containing one or more MPU files;
  4. MMT Package, a logical unit of multimedia content such as a broadcasting program, containing one or more MMT Assets, also containing
    1. Composition Information (CI), describing the spatio-temporal relationships among MMT Assets
    2. Delivery Information, describing the network characteristics.

MPEG-DASH

Dynamic adaptive streaming over HTTP (DASH) is another MPEG Systems standard that was motivated by the popularity of HTTP streaming and the existence of different protocols used in different streaming platforms, e.g. different manifest and segment formats. By developing the DASH standard for HTTP streaming of multimedia content, MPEG has enabled a standard-based client to stream content from any standard-based server, thereby enabling interoperability between servers and clients of different make.

Figure 12 – DASH model

As depicted in Figure 12, the multimedia content is stored on an HTTP server in two components: 1) Media Presentation Description (MPD) which describes a manifest of the available content, its various alternatives, their URL addresses and other characteristics, and 2) Segments which contain the actual multimedia bitstreams in form of chunks, in single or multiple files.

A typical operation of the system would follow the steps

  1. DASH client obtains the MPD;
  2. Parses the MPD;
  3. Gets information on several parameters, e.g. program timing, media content availability, media types, resolutions, min/max bandwidths, existence of alternatives of multimedia components, accessibility features and the required protection mechanism, the location of each media component on the network and other content characteristics;
  4. Selects the appropriate encoded alternative and starts streaming the content by fetching the segments using HTTP GET requests;
  5. Fetches the subsequent segments after appropriate buffering to allow for network throughput variations
  6. Monitors the network bandwidth fluctuations;
  7. Decides how to adapt to the available bandwidth depending on its measurements by fetching segments of different alternatives (with lower or higher bitrate) to maintain an adequate buffer.

DASH only defines the MPD and the segment formats. MPD delivery and media encoding formats containing the segments as well as client behavior for fetching, adaptation heuristics and content playing are outside of MPEG-DASH’s scope.

Conclusions

The reader should not think that this is an exhaustive presentation of MPEG’s Systems work. I hope the description will reveal the amount of work that MPEG has invested in Systems aspects, sometimes per se, and sometimes to provide adequate support to users of its media coding standards. This article also describes some of the most successful MPEG standards. At the top certainly towers MPEG-2 Systems of which 9 editions have been produced to keep up with continuous user demands for new functionalities.

Without mentioning the fact that MPEG-2 Systems has received an Emmy Award 😉.

Posts in this thread

 

 

MPEG: what it did, is doing, will do

Introduction

If I exchange words with taxi drivers in a city somewhere in the world, one of the questions I am usually asked is: “where are you from?”. As I do not like straight answers, I usually ask back “where do you think I am from?” It usually takes time before the driver gets the information he asked for. Then the next question is: “what is your job?”. Again, instead of giving a straight answer, I ask the question: “do you know MPEG?” Well, believe it or not, 9 out of 10 times the answer is “yes”, often supplemented by an explanation decently connected with what MPEG is.

Wow! Do we need a more convincing proof that MPEG has conquered the minds of the people of the world?

The interesting side of the story, though, is that, even if the name MPEG is known by billions of people, it is not a trademark. Officially, the word MPEG does not even exist. When talking to ISO you should say “ISO/IEC JTC 1/SC 29/WG 11” (next time, ask your taxi driver if they know this letter soup). The last insult is that the mpeg.org domain is owned by somebody who just keeps it without using it.

Should all this be of concern? Maybe for some, but not for me. What I have just talked about is just one aspect of what MPEG has always been. Do you think that MPEG was the result of high-level committees made of luminaries advising governments to take action on the future of media? You are going to be disappointed. MPEG was born haphazardly (read here, if you want to know how). Its strength is that it has been driven by the idea that the epochal transition from analogue to digital should not become another PAL-SECAM-NTSC or VHS-Betamax trap.

In 30 years MPEG has grown 20-fold, changed the way companies do business with media, made music liquid, multiplied the size of TV screens, brought media where there were stamp-size displays, made internet the primary delivery for media, created new experiences, shown that its technologies can successfully be applied beyond media…

There is no sign that its original driving force is abating, unless… Read until the end if you want to know more.

What did MPEG do?

MPEG-1 & MPEG-2

MPEG was the first standards group that brought digital media to the masses. In the 2nd half of the 1990’s the MPEG-1 and MPEG-2 standards were converted to products and services as the list below will show (not that the use of MPEG-1 and MPEG-2 is confined to the 1990’s).

  • Digital Audio Broadcasting: in 1995, just 3 years after MPEG-1 was approved, DAB services began to appear in Europe with DAB receivers becoming available some time later.
  • Portable music: in 1997, 5 years after MPEG-1 was approved, Saehan Information Systems launched MPMan, probably the first portable digital audio player for the mass market that used MP3. This was followed by a long list of competing players until the mobile handset largely took over that function.
  • Video CD: in the second half of the 1990’s VCD spread especially in South East Asia until the MPEG-2 based DVD, with its superior quality, slowly replaced it. It uses all 3 parts of MPEG-1 (layer 2 for audio).
  • Digital Satellite broadcasting: in June 1994 DirecTV launched its satellite TV broadcasting service for the US market, even before MPEG released the MPEG-2 standard in November of that year! It used MPEG-2 and its lead was followed by many other regions who gradually converted their analogue broadcast services to digital.
  • Digital Cable distribution: in 1992 John Malone launched the “500-channel” vision for future cable services and MPEG gave the cable industry the means to make that vision real.
  • Digital Terrestrial broadcasting:
    • In 1996 the USA Federal Communications Commission adopted the ATSC A/53 standard. It took some time, however, before wide coverage of the country, and of other countries following the ATSC standards, was achieved.
    • In 1998 the UK introduced Digital Terrestrial Television (DTT).
    • In 2003 Japan started DTT services using MPEG-2 AAC for audio in addition to MPEG-2 Video and TS.
    • DTT is not deployed in all countries yet, and there are regularly news of a country switching to digital, the MPEG way of course.
  • Digital Versatile Disc (DVD): toward the end of the 1990’s the first DVD players were put to market. They used MPEG-2 Program Stream (part 1 of MPEG-2) and MPEG-2 Video, and a host of audio formats, some from MPEG.

MPEG-4

In the 1990s the Consumer Electronics industry provided devices to the broadcasting and telecom industries. and devices for package media. The shift to digital services called for the IT industry to join as providers of big servers for broadcasting and interactive services (even though in the 1990’s the latter did not take off). The separate case of portable audio players provided by startups did not fit the established categories.

MPEG-4 played the fundamental role of bringing the IT industry under the folds of MPEG as a primary player in the media space.

  • Internet-based audio services: The great original insight of Steve Jobs and other industry leaders transformed Advanced Audio Coding (AAC) from a promising technology to a standard that dominates mobile devices and internet services
  • Internet video: MPEG-4 Visual, with the MP4 nickname, did not repeat the success of MP3 for video. Still it was the first example of digital media on the internet as DivX (a company name). Its hopes to become the streaming video format for the internet were dashed by the licensing terms of MPEG-4 Visual, the first example of ill-influence of technology rights on an MPEG standard
  • Video for all: MPEG-4 Advanced Video Coding (AVC) became a truly universal standard adopted in all areas and countries. Broadcasting, internet distribution, package media (Blu-ray) and more.
  • Media files: the MP4 File Format is the general structure for time-based media files, that has become another ubiquitous standard at the basis of modern digital media.
  • Advanced text and graphics: the Open Font Format (OFF), based on the OpenType specification, revised and extended by MPEG, is universally used.

MPEG-7

MPEG-A

  • Format for encrypted, adaptable multimedia presentation: is provided by the Common Media Application Format (CMAF), a format optimised for large scale delivery of protected media with a variety of adaptive streaming, broadcast, download, and storage delivery methods including DASH and MMT.
  • Interoperable image format: the Multi-Image Application Format (MIAF) enables precise interoperability points for creating, reading, parsing, and decoding images embedded in HEIF.

MPEG-B

  • Generic binary format for XML: is provided by Binary format for XML (BiM), a standard used by products and services designed to work according to ARIB and DVB specifications.
  • Common encryption for files and streams: is provided by Common Encryption (CENC) defined in two MPEG-B standards – Part 7 for MP4 Files and Parts 9 for MPEG-2 Transport Stream. CENC is widely used for the delivery of video to billions of devices capable to access internet-delivered stored files, MPEG-2 Transport Syteam and live adaptive streaming.

MPEG-H

  • IP-based television: MPEG Media Transport (MMT) is the “transport layer” of IP-based television. MMT assumes that delivery is achieved by an IP network with in-network intelligent caches close to the receiving entities. Caches adaptively packetise and push the content to receiving entities. MMT has been adopted by the ATSC 3.0 standard and is currently being deployed in countries adopting ATSC standards and also used in low-delay streaming applications.
  • More video compression, siempre!: has been provided by High Efficiency Video Coding (HEVC), the AVC successor yielding an improved compression up to 60% compared to AVC. Natively, HEVC supports High Dynamic Range (HDR) and Wider Colour Gamut (WCG). However, its use is plagued by a confused licensing landscape as described, e.g. in A crisis, the causes and a solution
  • Not the ultimate audio experience, but close: MPEG-H 3D Audio is a comprehensive audio compression standard capable of providing very satisfactory immersive audio experiences in broadcast and interactive applications, It is part of the ATSC 3.0 standard.
  • Comprehensive image file format: High Efficiency Image File Format (HEIF) is a file format for individual HEVC-encoded images and sequences of images. It is a container capable of storing HEVC intra-images and constrained HEVC inter-images, together with other data such as audio in a way that is compatible with the MP4 File Format. HEIF is widely used and supported by major OSs and image editing software.

MPEG-DASH

Streaming on the unreliable internet: Dynamic Adapting Streaming on HTTP (DASH) is the widely used standard that enables a media client connected to a media server via the internet to obtain instant-by-instant the version, among those available on the server, that best suites the momentary network conditions.

What is MPEG doing now?

In the preceding chapter I singled out only MPEG standards that have been (and often still continue to be) extremely successful.

I am  unable to single out those that will be successful in the future 😊, so the reasonable thing to do is to show the entire MPEG work plan

At the risk of making the wrong bet 😊. let me introduce some of the most high profile standards under development,  subdivided in the three categories Media Coding, Systems and Tools, and Beyond Media. But you have better become acquainted with all ongoing activities. In MPEG sometimes the last become the first.

Media Coding

  • Versatile Video Coding (VVC): is the flagship video compression activity that will deliver another round of improved video compression. It is expected to be the platform on which MPEG will build new technologies for immersive visual experiences (see below).
  • Enhanced Video Coding (EVC): is the shorter term project with less ambitious goals. EVC is designed to satisfy urgent needs from those who need a standard with a less complex IP landscape
  • Immersive visual technologies: investigations on technologies applicable to visual information captured by different camera arrangements are under way, as described in The MPEG drive to immersive visual experiences.
  • Point Cloud Compression (PCC): refers to two standards capable of compressing 3D point clouds captured with multiple cameras and depth sensors. The algorithms in both standards are lossy, scalable, progressive and support random access to point cloud subsets. See The MPEG drive to immersive visual experiences for more details.
  • Immersive audio: MPEG-H 3D Audio supports a 3 Degrees of Freedom or 3DoF (yaw, pitch, roll) experience at the movie “sweet spot”. More complete user experiences, however, are needed, i.e. 6 DoF (adding x, y, z). These can be achieved with additional metadata and rendering technology.

Systems and Tools

  • Omnidirectional media format: Omnidirectional Media Application Format (OMAF) v1 is a format supporting the interoperable exchange of omnidirectional (VR 360) content for a user who can only Yaw, Pitch and Roll their head. OMAF v2 will support some head translation movements. See The MPEG drive to immersive visual experiences for more details.
  • Storage of PCC data in MP4 FF: MPEG is developing systems support to enable storage and transport of compressed point clouds with DASH, MMT etc.
  • Scene Description Interface: MPEG is investigating the interface to the scene description (not the technology) to enable rich immersive experiences.
  • Service interface for immersive media: Network-based Media Processing will enable a user to obtain potentially very sophisticated processing functionality from a network service via standard API.
  • IoT when Things are Media Things: Internet of Media Things (IoMT) will enable the creation of networks of intelligent Media Things (i.e. sensors and actuators)

Beyond Media

  • Standards for biotechnology applications: MPEG is finalising all 5 parts of the MPEG-G standard and establishing new liaisons to investigate new opportunities.
  • Coping with neural networks everywhere: shortly (25 March 2019) MPEG will receive responses to its Call for Proposals for Neural Network Compression as described in Moving intelligence around.

What will MPEG do in the future?

At the risk of being considered boastful, I would think that MPEG should have deserved attention from some of the business schools that study socio-economic phenomena. Why? Because many have talked about media convergence, but they have forgotten that MPEG, with its standards, has actually triggered that convergence. MPEG people know the ecosystem at work in MPEG and I for one see how it is unique.

This has not happened. Let’s say that it is better to be neglected than to receive unwanted attention.

I would also think that a body that started from a Subcommittee on character sets and has become the reference standards group for the media industry, i.e. devices, content, services and applications, worth hundreds of billion USD with potent influences on a nearby industry such as telecommunication, should have suggested standards organisations to study the work method and possibly apply it to other domains.

This has not happened. Let’s say, again, that its is better to be neglected than to receive unwanted attention.

So can we expect MPEG to continue its mission, and apply its technologies and know how to continue delivering compression standards for immersive experiences and new compression standards for other domains?

Maybe this time MPEG will attract attention. So, don’t count on it.

Posts in this thread

 

 

The MPEG drive to immersive visual experiences

Introduction

In How does MPEG actually work? I described the MPEG process: once an idea is launched, context and objectives of the idea are identified; use cases submitted and analysed; requirements derived from use cases; and technologies proposed, validated for their effectiveness for eventual incorporation into the standard.

Some people complain that MPEG standards contain too many technologies supporting “non-mainstream” use cases. Such complaints are understandable but misplaced. MPEG standards are designed to satisfy the needs of different industries and what is a must for some, may well not be needed by others.

To avoid burdening a significant group of users of the standard with technologies considered irrelevant, from the very beginning MPEG adopted the “profile approach”. This allows to retain a technology for those who need it without encumbering those who do not.

It is true that there are a few examples where some technologies in an otherwise successful standard get unused. Was adding such technologies a mistake? In hindsight yes, but at the time a standard is developed the future is anybody’s guess and MPEG does not want find out later that one of its standards misses a functionality that was deemed to be necessary in some use cases and that technology could support at the time the standard was developed.

For sure there is a cost in adding the technology to the standard – and this is borne by the companies proposing the technology – but there is no burden to those who do not need it because they can use another profile.

Examples of such “non-mainstream” technologies are provided by those supporting stereo vision. Since as early as MPEG-2 Video, multiview and/or 3D profile(s) have been present in most MPEG video coding standards. Therefore, this article will review the attempts made by MPEG at developing new and better technologies to support what are called today immersive experiences.

The early days

MPEG-1 did not have big ambitions (but the outcome was not modest at all ;-). MPEG-2 was ambitious because it included scalability – a technology that reached maturity only some 10 years later – and multiview. As depicted in Figure 1, multiview was possible because if you have two close cameras pointing to the same scene, you can exploit intraframe, interframe and interview redundancy.

Figure 1 – Redundancy in multiview

Both MPEG-2 scalability and multiview saw little take up.

Both MPEG-4 Visual and AVC had multiview profiles. AVC had 3D profiles next to multiview profiles. Multiview Video Coding (MVC) of AVC was adopted by the Blu-ray Disc Association, but the rest of the industry took another turn as depicted in Figure 2.

Figure 2 – Frame packing in AVC and HEVC

If the left and right frames of two video streams are packed in one frame, regular AVC compression can be applied to the packed frame. At the decoder, the frames are de-packed after decompression and the two video streams are obtained.

This is a practical but less that optimal solution. Unless the frame size of the codec is not doubled, you either compromise the horizontal or the vertical resolution depending on the frame-packing method used. Because of this a host of other more sophisticates, but eventually non successful, frame packing methods have been introduced into the AVC and HEVC standards. The relevant information is carried by Supplemental Enhancement Information (SEI) messages, because the specific frame packing method used is not normative.

The HEVC standard, too, supports 3D vision with tools that efficiently compress depth maps, and exploit the redundancy between video pictures and associated depth maps. Unfortunately use of HEVC for 3D video has also been limited.

MPEG-I

The MPEG-I project – ISO/IEC 23090 Coded representation of immersive media – was launched at a time when the word “immersive” was prominent in many news headings. Figure 3 gives three examples of immersivity where technology challenges increase moving from left to right.


Figure 3 – 3DoF (left), 3DoF+ (centre) and 6DoF (left)

In 3 Degrees of Freedom (3DoF) the user is static but the head that can Yaw, Pitch and Roll. In 3DoF+ the user has the added capability of some head movements in the three directions. In 6 Degrees of Freedom the user can freely walk in a 3D space.

Currently there are several activities in MPEG that aim at developing standards that support some form of immersivity. While they had different starting points, they are likely to converge to one or, at least, a cluster of points (hopefully not to a cloud😊).

OMAF

Omnidirectional Media Application Format (OMAF) is not a way to compress immersive video but a storage and delivery format. Its main features are:

  1. Support of several projection formats in addition to the equi-rectangular one
  2. Signalling of metadata for rendering of 360ᵒ monoscopic and stereoscopic audio-visual data
  3. Use of MPEG-H video (HEVC) and audio (3D Audio)
  4. Several ways to arrange video pixels to improve compression efficiency
  5. Use of the MP4 File Format to store data
  6. Delivery of OMAF content with MPEG-DASH and MMT.

MPEG has released OMAF in 2018 that is now published as an ISO standard (ISO/IEC 23090-2).

3DoF+

If the current version of OMAF is applied to a 3DoF+ scenario, the user may feel parallax errors that are more annoying the larger the movement of the head.

To address this problem, at the January 2019 meeting MPEG has issued a call for proposals requesting appropriate metadata (see the red blocks in Figure 4) to help the Post-processor to present the best image based on the viewer’s position if available, or to synthesise a missing one, if not available.

Figure 4 – 3DoF+ use scenario

The 3DoF+ standard will be added to OMAF which will be published as 2nd edition. Both standards are planned to be completed in October 2020.

VVC

Versatile Video Coding (VVC) is the latest in the line of MPEG video compression standards supporting 3D vision. Currently VVC does not specifically include full-immersion technologies, as it only supports omnidirectional video as in HEVC. However, VVC could not only replace HEVC in the Figure 4, but also be the target of other immersive technologies as will be explained later.

Point Cloud Compression

3D point clouds can be captured with multiple cameras and depth sensors with points that can number a few thousands up to a few billions, and with attributes such as colour, material properties etc. MPEG is developing two different standards whose choice depends on whether the points are dense (Video-based PCC) or less so (Graphic-based PCC). The algorithms in both standards are lossy, scalable, progressive and support random access to subsets of the point cloud. See here for an example of a Point Cloud test sequence being used by MPEG for developing the V-PCC standard.

MPEG plans to release Video-based Point Cloud Compression as FDIS in October 2019 and Graphic-based PCC Point Cloud Compression as FDIS in April 2020.

Next to PCC compression MPEG is working on Carriage of Point Cloud Data with the goal to specify how PCC data can be stored in ISOBMFF and transported with DASH, MMT etc.

Other immersive technologies

6DoF

MPEG is carrying out explorations on technologies that enable 6 degrees of freedom (6DoF). The reference diagram for that work is what looks like a minor extension of the 3DoF+ reference model (see Figure 5), but may have huge technology implications.

Figure 5 – 6DoF use scenario

To enable a viewer to freely move in a space and enjoy a 3D virtual experience that matches the one in the real world, we still need some metadata as in 3DoF+ but also additional video compression technologies that could be plugged into the VVC standard.

Light field

The MPEG Video activity is all about standardising efficient technologies that compress digital representations of sampled electromagnetic fields in the visible range captured by digital cameras. Roughly speaking we have 4 types of camera:

  1. Conventional cameras with a 2D array of sensors receiving the projection of a 3D scene
  2. An array of cameras, possibly supplemented by depth maps
  3. Point clouds cameras
  4. Plenoptic cameras whose sensors capture the intensity of light from a number of directions that the light rays travel to reach the sensor.

Technologically speaking, #4 is an area that has not been shy in promises and is delivering on some of them. However, economic sustainability for companies engaged in developing products for the entertainment market has been a challenge.

MPEG is currently engaged in Exploration Experiments (EE) to check

  1. The coding performance of Multiview Video Data (#2) for 3DoF+ and 6DoF, and Lenslet Video Data (#4) for Light Field
  2. The relative coding performance of Multiview coding and Lenslet coding, both for Lenslet Video Data (#4).

However, MPEG is not engaged in checking the relative coding performance of #2 data and #4 data because there are no #2 and #4 test data for the same scene.

Conclusion

In good(?) old times MPEG could develop video coding standards – from MPEG-1 to VVC – by relying on established input video formats. This somehow continues to be true for Point Clouds as well. On the other hand, Light Field is a different matter because the capture technologies are still evolving and the actual format in which the data are provided has an impact on the actual processing that MPEG applies to reduce the bitrate.

MPEG has bravely picked up the gauntlet and its machine is grinding data to provide answers that will eventually lead to one or more visual compression standards to enable rewarding immersive user experiences.

MPEG is planning a “Workshop on standard coding technologies for immersive visual experiences” in Gothenburg (Sweden) on 10 July 2019. The workshop, open to the industry, will be an opportunity for MPEG to meet its client industries, report on its results and discuss industries’ needs for immersive visual experiences standards.

Posts in this thread

There is more to say about MPEG standards

Introduction

In Is there a logic in MPEG standards? I described the first steps in MPEG life that look so “easy” now: MPEG-1 (1988) for interactive video and digital audio broadcasting; MPEG-2 (1991) for digital television; MPEG-4 (1993) for digital audio and video on fixed and mobile internet; MPEG-7 (1997) for audio-video-multimedia metadata; MPEG-21 (2000) for trading of digital content. Just these 5 standards, whose starting dates cover 12 years i.e. 40% of MPEG’s life time, include 86 specifications, i.e. 43% of the entire production of MPEG standards.

MPEG-21 was the first to depart from the one and trine nature of MPEG standards: Systems, Video and Audio, and that departure has continued until MPEG-H. Being one and trine is a good qualification for success, but MPEG standards do not have to be one and trine to be successful, as this paper will show.

A bird’s eye view of MPEG standards

The figure below presents a view of all MPEG standards, completed, under development or planned. Yellow indicates that the standard has been dormient for quite some time, light brown indicates that the standard is still active, and white indicates standards that will be illustrated in the future

MPEG-A

The official title of MPEG-A is Multimedia Application Formats. The idea behind MPEG-A is kind of obvious: we have standards for media elements (audio, video 3D graphics, metadata etc.), but what should one do to be interoperable when combining different media elements? Therefore MPEG-A is a suite of specifications that define application formats integrating existing MPEG technologies to provide interoperability for specific applications. Unlike the preceding standards that provided generic technologies for specific contexts, the link that unites MPEG-A specifications is the task of combing MPEG and, when necessary, other technologies for specific needs.

An overview of the MPEG-A standard is available here. Some of the 20 MPEG-A specifications are briefly described below:

  1. Part 2 – MPEG music player application format specifies an “extended MP3 format” to enable augmented sound experiences (link)
  2. Part 3 – MPEG photo player application format specifies additional information to a JPEG file to enable augmented photo experiences (link)
  3. Part 4 – Musical slide show application format is a superset of the Music and Photo Player Application Formats enabling slide shows accompanied by music
  4. Part 6 – Professional archival application format specifies a format for carriage of content, metadata and logical structure of stored content and related data protection, integrity, governance, and compression tools (link)
  5. Part 10 – Surveillance application format specifies a format for storage and exchange of surveillance data that include compression video and audio, file format and metadata (link)
  6. Part 13 – Augmented reality application format specifies a format to enable consumption of 2D/3D multimedia content including both stored and real time, and both natural and synthetic content (link)
  7. Part 15 – Multimedia Preservation Application Format specifies the Multimedia Preservation Description Information (MPDI) that enables a user to discover, access and deliver multimedia resources (link)
  8. Part 18 – Media Linking Application Format specifies a data format called “bridget”, a link from a media item to another media item that includes source, destination, metadata etc. (link)
  9. Part 19 – Common Media Application Format combines and restricts different technologies to deliver and combine CMAF Media Objects in a flexible way to form multimedia presentations adapted to specific users, devices, and networks (link)
  10. Part 22 – Multi-Image Application Format enables precise interoperability points for creating, reading, parsing, and decoding images embedded in a High Efficiency Image File (HEIF).

MPEG-B

The official title of MPEG-B is MPEG systems technologies. After developing MPEG-1, -2, -4 and -7, MPEG realised that there were specific systems technologies that did not fit naturally into any part 1 of those standards. Thus, after using the letter A in MPEG-A, MPEG decided to use the letter B for this new family of specifications.

MPEG-B is composed of 13 parts, some of which are

  1. Part 1 – Binary MPEG format for XML, also called Binary MPEG format for XML or BiM, specifies a set of generic technologies for encoding XML documents adding to and integrating the specifications developed in MPEG-7 Part 1 and MPEG-21 part 16 (link)
  2. Part 4 – Codec configuration representation specifies a framework that enables a terminal to build a new video or 3D Graphics decoder by assembling standardised tools expressed in the RVC CAL language (link)
  3. Part 5 – Bitstream Syntax Description Language (BSDL) specifies a language to describe the syntax of a bistream
  4. Part 7 – Common encryption format for ISO base media file format files specifies elementary stream encryption and encryption parameter storage to enable a single MP4 file to be used on different devices supporting different content protection systems (link)
  5. Part 9 – Common Encryption for MPEG-2 Transport Streams is a similar specification as Part 7 for MPEG-2 Transport Stream
  6. Part 11 – Green metadata specifies metadata to enable a decoder to consume less energy while still providing a good quality video (link)
  7. Part 12 – Sample Variants defines a Sample Variant framework to identify the content protection system used in the client (link)
  8. Part 13 – Media Orchestration contains tools for orchestrating in time (synchronisation) and space the automated combination of multiple media sources (i.e. cameras, microphones) into a coherent multimedia experience rendered on multiple devices simultaneously
  9. Part 14 – Partial File Format enables the description of an MP4 file partially received over lossy communication channels by providing tools to describe reception data, the received data and document transmission information
  10. Part 15 – Carriage of Web Resource in MP4 FF specifies how to use MP4 File Format tools to enrich audio/video content, as well as audio-only content, with synchronised, animated, interactive web data, including overlays.

 MPEG-C

The official title of MPEG-C is MPEG video technologies. As for systems, MPEG realised that there were specific video technologies supplemental to video compression that did not fit naturally into any part 2 of the MPEG-1, -2, -4 and -7 standards.

MPEG-B is composed of 6 parts, two of which are

  1. Part 1 – Accuracy requirements for implementation of integer-output 8×8 inverse discrete cosine transform, was created after IEEE had discontinued a similar standard which is at the basis of important video coding standards
  2. Part 4 – Media tool library contains modules called Functional Units expressed in the RVC-CAL language. These can be used to assemble some of the main MPEG Video coding standards, including HEVC and 3D Graphics compression standards, including 3DMC.

MPEG-D

The official title of MPEG-C is MPEG audio technologies. Unlike MPEG-C, MPEG-D parts 1, 2 and 3 actually specify audio codecs that are not generic, as MPEG-1, MPEG-2 and MPEG-4 but intended to address specific application targets.

MPEG-D is composed of 5 parts, the first 4 of which are

  1. Part 1 – MPEG Surround specifies an extremely efficient method for coding of multi-channel sound via the transmission of a 1) compressed stereo or monoaudio program and 2) a low-rate side-information channel with the advantage of retaining backward compatibility to now ubiquitous stereo playback systems while giving the possibility to next-generation players to present a high-quality multi-channel surround experience
  2. Part 2 – Spatial Audio Object Coding (SAOC) specifies an audio coding algorithm capable to efficiently handle individual audio objects (e.g. voices, instruments, ambience, ..) in an audio mix and to allow the listener to adjust the mix based on their personal taste, e.g. by changing the rendering configuration of the audio scene from stereo over surround to possibly binaural reproduction
  3. Part 3 – Unified Speech and Audio Coding (USAC) specifies an audio coding algorithm capable to provide consistent quality for mixed speech and music content with a quality that is better than codecs that are optimized for either speech content or music content
  4. Part 4 – Dynamic Range Control (DRC) specifies a unified and flexible format supporting comprehensive dynamic range and loudness control, addressing a wide range of use cases including media streaming and broadcast applications. The DRC metadata attached to the audio content can be applied during playback to enhance the user experience in scenarios such as ‘in a crowded room’ or ‘late at night’

MPEG-E

This standard is the result of an entirely new direction of MPEG standardisation. Starting from the need to define API that applications can call to access key MPEG technologies MPEG developed a Call for Proposal to which several responses were received  MPEG reviewed the responses and developed the ISO/IEC standard called Multimedia Middleware.

MPEG-E is composed of 8 parts

  1. Part 1 – Architecture specifies the MPEG Multimedia Middleware (M3W) architecture that allows applications to execute multimedia functions without requiring detailed knowledge of the middleware and to update, upgrade and extend the M3W
  2. Part 2 – Multimedia application programming interface (API) specifies the M3W API that provide media functions suitable for products with different capabilities for use in multiple domains
  3. Part 3 – Component model specifies the M3W component model and the support API for instantiating and interacting with components and services
  4. Part 4 – Resource and quality management, Part 5 – Component download, Part 6 – Fault management and Part 7 – System integrity management specify the support API and the technology used for M3W Component Download Fault Management Integrity Management and Resource Management, respectively

MPEG-V

The development of the MPEG-V standard Media context and control started in 2006 from the consideration that MPEG media – audio, video, 3D graphics etc. – offer virtual experiences that may be a digital replica of a real world, a digital instance of a virtual world or a combination of natural and virtual worlds. At that time, however, MPEG could not offer users any means to interact with those worlds.

MPEG undertook the task to provide standard interactivity technologies that allow a user to

  1. Map their real-world sensor and actuator context to a virtual-world sensor and actuator context, and vice-versa and
  2. Achieve communication between virtual worlds.

This is depicted in the figure

All data streams indicated are specified in one or more of the 7 MPEG-V parts

  1. Part 1 – Architecture expands of the figure above
  2. Part 2 – Control information specifies control devices interoperability (actuators and sensors) in real and virtual worlds
  3. Part 3 – Sensory information specifies the XML Schema-based Sensory Effect Description Language to describe actuator commands such as light, wind, fog, vibration, etc. that trigger human senses
  4. Part 4 – Virtual world object characteristics defines a base type of attributes and characteristics of the virtual world objects shared by avatars and generic virtual objects
  5. Part 5 – Data formats for interaction devices specifies syntax and semantics of data formats for interaction devices – Actuator Commands and Sensed Information – required to achieve interoperability in controlling interaction devices (actuators) and in sensing information from interaction devices (sensors) in real and virtual worlds
  6. Part 6 – Common types and tools specifies syntax and semantics of data types and tools used across MPEG-V parts.

Conclusion

The standards from MPEG-A to MPEG-V include 59 specifications that extend over the entire 30 years of MPEG activity. These standards account for 29% of the entire production of MPEG standards.

In this period of time MPEG standards have addessed more of the same technologies – systems (MPEG-B), video (MPEG-C) and audio (MPEG-D) – and have covered other features beyond those initially addressed: application formats (MPEG-A), media application life cycle (MPEG-E), and interaction of the real world with virtual worlds, and between virtual world (MPEG-V).

Media technologies evolve and so do their applications. Sometimes applications succeed and sometimes fail. So do MPEG standards.

Posts in this thread

Moving intelligence around

Introduction

Artificial intelligence has reached the attention of mass media and technologies supporting it – Neural Networks (NN) – are being deployed in several contexts affecting end users, e.g. in their smart phones.

If a NN is used locally, it is possible to use existing digital representation of NNs (e.g., NNEF, ONNX). However, these format miss vital features for distributing intelligence, such as compression, scalability and incremental updates.

To appreciate the need for compression let’s consider the case of adjusting the automatic mode of a camera based on recognition of scene/object obtained by using a properly trained NN. As this area is intensely investigated, very soon there will be a new better trained version of the NN or a new NN with additional features. However, as the process to create the necessary “intelligence” usually takes time and labor (skilled and unskilled), in most cases the new created intelligence must be moved from the center to where the user handset is. With today’s NNs reaching a size of several hundred Mbytes and growing, a scenario where millions of users clog the network because they are all downloading the latest NN with great new features looks likely.

This article describes some elements of the MPEG work plan to develop one or more standards that enable compression of neural networks. Those wishing to know more please read Use cases and Requirements, and Call for Proposals.

About Neural Networks

A Neural Network is a system composed of connected nodes each of which can

  1. Receive input signals from other nodes,
  2. Process them and
  3. Transmit an output signal to other nodes.

Nodes are typically aggregated into layers, each performing different functions. Typically the “first layers” are rather specific of the signals (audio, video, various forms of text information etc.). Nodes can send signals to subsequent layers but, depending on the type of network, also to the preceding layers.

Training is the process of “teaching” a network to do a particular job, e.g. recognising a particular object or a particular word. This is done by presenting to the NN data from which it can “learn”. Inference is the process of presenting to a trained network new data to get a response about what the new data is.

When is NN compression useful?

Compression is useful whenever there is a need to distribute NNs to remotely located devices. Depending on the specific use case, compression should be accompanied by other features. In the following two major use cases will be analysed.

Public surveillance

In 2009 MPEG developed the Surveillance Application Format. This is a standard that specifies the package (file format) containing audio, video and metadata to be transmitted to a surveillance center. Today, however, it is possible to introduce to ask the surveillance network to do more more intelligent things by distributing intelligence even down to the level of visual and audio sensors.

For this more advanced scenarios MPEG is developing a suite of specifications under the title of Internet of Media Things (IoMT) where Media Things (MThing) are the media “versions” of IoT’s Things. The IoMT standard (ISO/IEC 23093) will reach FDIS level in March 2019.

The IoMT reference model is represented in the figure

IoMT standardises the following interfaces:

1: User commands (setup info.) between a system manager and an MThing

1’: User commands forwarded by an MThing to another MThing, possibly in a modified form (e.g., subset of 1)

2: Sensed data (Raw or processed data in the form of just compressed data or resulting from a semantic extraction) and actuation information

2’: Wrapped interface 2 (e.g. for transmission)

3: MThing characteristics, discovery

IoMT is neutral as to the type of semantic extraction or, more generally, to nature of intelligence actually present in the cameras. However, as NNs networks are demonstrating better and better results for visual pattern recognition, such as object detection, object tracking and action recognition, cameras can be equipped with NNs capable to process the information captured to achieve a level of understanding and transmit that understanding through interface 2.

Therefore, one can imagine that re-trained or brand new NNs can be regularly uploaded to a server that distributes NNs to surveillance cameras. Distribution need not be uniform since different neural networks may be needed at different areas, depending on the tasks that need to be specifically carried out at given areas.

NN compression is a vitally important technology to make the described scenarios real because automatic surveillance system may use many cameras (e.g. thousands and even million units) and because, as the technology to create NNs matures, the time between NN updates will progressively become shorter.

Distribution of NN-based apps to devices

There are many cases where compression is useful to efficiently distribute heavy NN-based apps to a large number of devices, in particular mobile. Here 3 case are considered.

  1. Visual apps. Updating a NN-based camera app in one’s mobile handset will soon become common place. Ditto for the many conceivable application where the smart phone understand some of the objects in the world around. Both will happen at an accelerated frequency.
  2. Machine translation (speech-to-text, translation, text-to-speech). NN-based translation apps already exist and their number, efficiency, and language support can only increase.
  3. Adaptive streaming. As AI-based methods can improve the QoE, the coded representation of NNs can initially be made available to clients prior to streaming while updates can be made during streaming to enable better adaptation decisions, i.e. better QoE.

Requirements

The MPEG Call for Proposals identifies a number of requirements that a compressed neural network should satisfy. Even though not all applications need the support of all requirements, the NN comnpression algorithm must eventually be able to support all the identified requirements.

  1. Compression shall have a lossless mode, i.e. the performance of the compressed NN is exactly the same as the uncompressed NN
  2. Compression shall have a lossy mode, i.e. the performance of the decompressed NN can be different than the performance of the uncompressed NN of course in exchange for more compression
  3. Compression shall be scalable, i.e. even if only a subset of the compressed NN is used, there is still a level of performance
  4. Compression shall support incremental updates, i.e. as more data are received the performance of NN improves
  5. Decompression shall be possible with limited resources, i.e. with limited processing performance and memory
  6. Compression shall be error resilient, i.e. if an error occurs during transmission, the file is not lost
  7. Compression shall be robust to interference, i.e. it is possible to detect that the compressed NN has been tampered with
  8. Compression shall be possible even if there is no access to the original training data
  9. Inference shall be possible using compressed NN
  10. Compression shall supportincremental updates from multiple providers to improve performance of a NN

Conclusions

The currently published Call for Proposals is not requesting technologies for all requirements listed above (which are themselves a subset of all identified requirements). It is expected, however, that the responses to the CfP will provide enough technology to produce a base layer standard that will help the industry move its first steps in this exciting field that will shape the way intelligence is added to things near to all of us.

Posts in this thread