
Video quality evaluation. Subjective evaluation of video quality Part 2

Although subjective testing was carried out many times by various organizations, until recently there were no stable testing programs designed to work with personal computers in the public domain.

This was the reason for the development of MSU’s perceptual video quality tool, which implements various methods for subjective comparison and analysis of results.
The subjective test method is a combination of methods to demonstrate sequences, gather expert opinions, and process the results.
Let us consider, using an example of video codec comparison, the test procedure by the SAMVIQ method, recently developed at the EBU (European Broadcasting Union), the implementation of the MSU Perceptual Video Quality tool. This method was used in the subjective comparison of modern video codecs.
SAMVIQ method diagram
Test stages:
1. Expert enters her name (any unique sequence of characters).
2. Color perception test (standard Ishihara charts are used).
3. For each test sequence:
The reference video (original) is shown.
Whenever there are unseen compressed versions of this video, the expert selects the next version of the video, watches it, and rates it. The movie rating belongs to the 0-100 segment, the higher the better. The evaluation of the sequence variants already reviewed can be changed at any time, it is also possible to review any of the variants.
If all variants of the video have been viewed, the expert can proceed to the next test sequence.
Different variants of the compressed sequence are hidden behind the letter designations, so the expert does not know which codec he is evaluating at the moment. The reference video is explicitly available, it is also hidden under one of the letter designations and is rated on a par with the compressed video streams.
Why are such complications necessary? There are several problems that subjective testing techniques should solve. The first of these is to create a common rating scale for all experts, that is, so that the “good” rating means roughly the same for different experts. This is achieved through a technique called “anchoring”: during the test, both the video with the highest quality (“high anchor”, must be associated by all the experts with the maximum score), and with the lowest (“low anchor “, should be associated with a minimum score).
Another task is to minimize the memory effect, the influence of the sequence in which the video is shown on the expert evaluations. In some test methods, this problem is solved by displaying the reference video (original) together with each processed video sequence. In the SAMVIQ method, which we used in the comparison, the first problem is solved using a hidden and explicitly available reference video, and the second, using a more flexible evaluation procedure than in other methods (an expert can review the video and change your evaluations).
With any test method, subjective test results can be influenced by many external factors. It is essential that all testers are instructed on how to pass the test, that there is adequate lighting in the room, and that the tests should not tire the experts. Anything from the gender of the experts to their professions to the timing of testing can make a difference in results. Interestingly, compared to all other factors, monitor characteristics (resolution, LCD / CRT, etc.) do not significantly affect results (see M. Pinson, S. Wolf, “The impact of monitor type and resolution in the subjective video Quality tests “NTIA TM-04-412). Processing of results
The main results are obtained after a simple average of the evaluations by experts. The resulting score is called the MOS (Mean Opinion Score). In addition, to evaluate the dissemination of opinions, a confidence interval is usually given (the interval in which the real mean opinion is located with a certain probability). There are techniques to exclude experts that give results that are unstable and very different from the average.
At the end of 2005, our laboratory carried out subjective tests of video codecs. The test tasks were the subjective comparison of new versions of popular codecs, the comparison of results with objective metric data, and the development of subjective test technology. This article contains only a part of the results obtained.



