
MPEG: questions and answers

What is MPEG?
MPEG is a group of people within the ISO (International Standards Organization) who come together to develop standards for digital video and audio compression. Specifically, they identified a compressed flow and a decompressor for it. The compression algorithms are determined individually by each manufacturer, which is the advantage of the published International Standard. The MPEG group meets approximately four times a year for approximately one week. Most of the work is done between meetings, organizing and scheduling for them.
Does it have something to do with JPEG?
Yes, the names are consonants and the groups belong to the same ISO subcommittee, along with JBIG and MHEG, and they meet at the same place and time. However, they are different people with different goals and needs. JPEG is for compressing still images, while MPEG is for live video and accompanying audio.
What are JBIG and MHEG?
JBIG is for the compression of binary images (faxes, …), and MHEG is for multimedia data, integrating still images, video, audio, text, etc.
How does MPEG video work?
A digital color image of the compressed sequence is converted to the YUV (YCbCr) color space. The Y component represents intensity and the U and V components represent chromaticity. Since the human eye is less sensitive to chroma than to intensity, the resolution of the color components can be reduced 2 times vertically, or both vertically and horizontally. For animation and high-quality studio video, downsampling is not applied to preserve quality, and for home applications where streams are lower and equipment is cheaper, this action does not lead to noticeable loss. on visual perception, while saving valuable bits of data.
The basic idea of the whole scheme is to predict the movement from one frame to another, and then apply a discrete cosine transform (DCT) to redistribute the redundancy in space. DCT is performed in 8×8 point blocks, motion prediction is performed in intensity channel (Y) in 16×16 point blocks or, depending on the characteristics of the original image sequence (interlaced scan, content), in blocks 16×8 dots. In other words, a given 16×16 pixel block is searched in the current frame in the corresponding larger area in the previous or subsequent frames. The DCT coefficients (the original data or the difference between this block and its corresponding one) are quantized, that is, they are divided by a certain number to discard insignificant bits. Many coefficients after such an operation turn out to be zero.
How are the frames related to each other?
There are three types of encoded frames. I-frames are frames encoded as still images, without reference to the next or the next. They are used as starting points. P-frames are predicted frames from previous I or P-frames. Each macroblock in the P frame can come with a vector and the DCT coefficient difference of the corresponding block from the last decoded I or P, or can be encoded as in I, if the corresponding block was not found.
And finally, there are the B-frames, which are predicted from the two closest I or P-frames, one before and one after. The corresponding blocks are searched in these frames and the best of them is selected. The direct vector is searched, then the inverse, and the average between the corresponding macroblocks in the past and the future is calculated. If this doesn’t work, then the block can be encoded as an I-frame.
The sequence of decoded frames generally looks like
IBBPBBPBBPBBIBBPBBPB …
There are 12 frames from I to I frame. This is based on the random access requirement that the start point must repeat every 0.4 seconds. The relationship from P to B is based on experience.
For the decoder to work, the first P frame in the sequence must be before the first B, so the compressed sequence looks like this:
0 xx 3 1 2 6 4 5 …
where the numbers are numbers of frames. xx may be nothing if it is the beginning of a sequence, or frames B -2 and -1 if it is in the middle of a sequence.
You need to decode I frame first, then P, then, having both in memory, decode B. During decoding, P is displayed I frame, B is displayed immediately, and the decoded P is displayed during decoding of the next.



