TestDevLab’s Approach to Subjective MOS Video Quality Evaluation

Based on the International Telecommunication Union (ITU) definition, a Subjective Score is a value, on a predefined scale, given by a subject, that represents their opinion of the media quality. The Mean Opinion Score (MOS) is the average of these scores across multiple subjects.

MOS has emerged as the most popular indicator of perceived media quality. While it is rather easy to collect these scores, and it brings a lot of benefits by giving us information about user-perceived quality in multimedia experiences, there are high risks of collecting faulty results, if used without following recommended procedures.

While ITU standards are focused on specific application areas, e.g. television images, they can be adapted to your specific needs and requirements based on the product under testing.

In TestDevLab, we work with many different clients, covering most of the video-related products - conferencing, streaming, video on demand, short-form videos, and any other services that rely on video. Therefore, we find a personalized approach for each new client, making project-specific requirements and the scope of the evaluation, while understanding the goal and expected metrics.

Why MOS instead of Objective metrics?

In most cases VMAF, VQTDL, or any other objective video quality metric would be more effective, as it is quicker to use, it doesn’t get affected by mood or loss of focus of the participants, and it doesn’t require 15 people with prior training. However, there are situations when MOS just makes more sense as the main video quality evaluation metric or can at least be used as secondary video quality evaluation metric.

Some of the reasons for our clients to choose MOS as the main video quality metric are:

The need to get the scores from real-time video playback straight from the app, creating that real user experience, without any screen recording or video extraction interference;
Most objective video quality evaluation algorithms require a source video for comparison or training, which isn't always available. For example, a regular back camera-filmed live broadcast can film a setup that is constantly moving and changing and doesn't have a main source file.
It is required to evaluate the quality of content available in the app that was not created specifically for video quality evaluation metrics. While, subjectively, it is rather easy to say if the video quality has dropped, objective video quality algorithms are more sensitive to blurriness, low-brightness videos, out-of-focus elements, fast-moving objects that create motion blur, and so on. Therefore, when working with random videos that were not filmed with the goal of using them for video quality evaluation, Subjective MOS might be a more trustful metric to use.
Subjective MOS is great in the way that it doesn’t have a pre-written evaluation method and can be adjusted to the requirements. We can decide what is the goal and adjust the focus on different aspects of the video, making results and observations more topical.

Some of the reasons to choose MOS as a secondary video quality metric are:

Situations, when there are no specific requirements or expected metrics decided on yet. MOS can be used in the first testing run to understand the app's behavior and make first observations, after which, we can come up with advice and recommendations on what approach and metrics could be used for future testing. After the MOS session, we will know if it will be enough with just objective video quality metrics, or if we should add fluidity detection metrics such as freeze, stall, jitter detection, or any additional video quality metrics, such as blurriness, color analysis, and so on.
When using both objective and subjective video quality evaluation metrics, results can provide beneficial observations by seeing if objective metrics correlate with subjective user experience, which further can be used for new conclusions or additional investigations. Every situation can be unique and we constantly encounter new reasons why MOS benefits the video quality evaluation process.
Objective video quality score doesn’t always match user stories. For example, your product gets the highest VMAF scores, but you still get feedback that competitors have better quality. Maybe it is time for MOS evaluation. More often than not, deeper investigations show that quality is not just one objective number. And while for one user great video quality means pixel quality, for other users more saturated colors might be more important than some minor blur on the edges. In situations like these, we do a deeper analysis with deep investigations about the differences between clients' and competitors' products, different devices, user accounts, and many other aspects that regular data processing scripts might miss.

In the graph below, we see 30 videos from different short-form video apps evaluated by VMAF and MOS, arranged in ascending order from video with lowest quality to highest. Results show a correlation between both metrics, with some obvious outliers. These outliers are further investigated, and reasons for such differences in scores are analyzed.

In the example below we can see the same setup, and the same conditions, but two different apps on two different devices.

The image on the left is more saturated and vivid, but the elements themselves are slightly blurred out. The image on the right has duller colors, but the sharpness of the image is a lot more visible.

In this situation, there is no correct answer which example is higher quality. One person might like the saturated colors, while others will choose pixel quality and more details.

This is where subjective video evaluation is beneficial. For example, even for the same video, scores might differ a lot between participants, and we will be able to clearly see the reasons and explain the results.

Evaluation Event

Selecting Participants

To collect the most realistic results, we must choose a diverse pool of participants who represent the required target audience. During this step, it has to be decided if experts or non-experts should evaluate the quality. While experts usually have similar opinions about the quality, they are usually more critical of the smallest quality changes. Meanwhile, non-experts might give higher scores, or the range of scores might be wider for the same sample, but it might better represent the real user experience.

Evaluation should be executed by enough participants to obtain a statistically significant result. ITU-R Recommendation BT.500-14 recommends a minimum of 15 participants. The number can be adjusted based on the desired confidence interval.

Training

Each participant has to be introduced to requirements and evaluation standards before the actual evaluation event.

Before evaluating the quality of the sample videos, participants must:

See the source video (if available) to understand the highest quality possible;
Be introduced to the possible artifacts that can affect video quality;
Be introduced to any project-specific behavior that can affect the score;
If possible, they also must be shown a wide range of the evaluated content’s possible quality levels and formats. This can be combined with initial training so that participants get used to the process and get comfortable with using the whole range of scoring.

Each project can have its own requirements, therefore, participants also have to be introduced to the goal of the evaluation event. A few examples might be:

Is the goal to evaluate only image quality without taking into consideration the fluidity of the video or is the goal to evaluate a full user experience, taking into consideration freezes, buffering, and other fluidity issues?
Are some specific short-term quality drops taken into consideration when giving an average score of the video quality? Or is it required to evaluate the overall quality? In case some specific short-term quality drops are not counted in the average video quality scoring, should the participant mark down how often a visible quality drop was noticed so that this information can be used in further result analysis?

Environment Setup

It is highly important to provide the same controlled conditions for all participants.

Collecting video samples for evaluation

We make sure to collect the media samples under similar conditions so that the comparison can be done apples-to-apples. For example, a video downloaded from the app should not be compared to a video filmed using a mobile device back camera, as there will be totally different formats, expected artifacts, color and object depth, and so on.

In some cases, exceptions can be made, if it were a specific requirement to compare different methods of collecting samples/formats of the source videos.

Sharing video samples with participants

First, based on project requirements, we decide if the videos can be shared with the participants so they can access the materials on their own, or if the materials will only be available in real-time playback in a designated evaluation space. Some options might be:

An open portal that provides participants with videos and a scoring system that can be accessed from a local machine, in any location of participants choosing. In this case, we provide participants with detailed environment requirements that must be followed.
A folder in the shared server that lets participants access the videos in a location of their choosing. In this case, we make sure participants download the files locally and provide participants with detailed environment requirements that must be followed.
There might also be situations when participants have to watch videos in real time from the device/app under testing. In this case, we create a laboratory viewing environment and invite participants to the designated evaluation space.

General viewing conditions

All participants have to evaluate videos in the same viewing conditions, as even the smallest changes in screen settings will change the perception of the media.

Examples of requirements:

Viewing Device - e.g. specific mobile devices, laptop built-in display, specific external monitors, etc.;
Screen settings - e.g. brightness, color temperature, contrast ratio, resolution, zoom;
Minimum battery level on the device under testing;
Network conditions;
Room illumination;
Viewing distance and observation angle - should be based on the screen size;
Media playback tools - e.g. specific browser to access the portal, media player;

Additional possible requirements for playback in the apps:

Should the cache be cleared after every playback?
Is there a cool-down waiting time between playbacks?
What user account should be used? E.g. sender/creator/broadcaster watches their own video, or a different user account is chosen as a viewer. Are there any experimental user accounts that could be tested?

Every case can have additional requirements added. Organizers of Subjective MOS evaluation events have to go through the whole process to make sure every requirement is covered. Since the participants of the event play the biggest role in the creation of the results, we make sure they are detail-oriented and question the process, even if provided with highly detailed instructions, as even the smallest changes in the processes can affect results.

Evaluation

When all the samples have been collected, participants have been introduced to the process, and the evaluation environment is prepared, it is time to start the Evaluation event.

If possible, evaluation should be executed in the period of one day, as the opinion might change based on a person's mood, energy level, surroundings, and other physical or psychological factors. If the evaluation must be continued for a longer period, participants must be provided the opportunity to access their own past evaluated materials to recall past decisions on scoring.
Participants should also be able to take breaks between evaluations to reduce fatigue and regain the ability to focus.

Preparation

Each participant has to be prepared for the evaluation day. They must make sure that there are no distractions during the evaluation period.

Participants must receive the materials in randomized order. If video files are given, file naming should not contain any information about the tested platform, conditions, quality, and other information that might interfere with their subjective opinion.

Evaluation

Each participant proceeds to view the video sample under evaluation and gives it an appropriate score based on requirements and their opinion.

The most popular scale in which MOS is evaluated is a 5-point scale. And while a 5-point scoring is used, participants should have the option to choose smaller steps between the scores. The most often used score step is 0.25 points which give us enough range to choose between different qualities.

When evaluating a big set of data, participants should have the option to change their scores.

The reason for that is:

Participants choose a wrong score by accident;
The participant was presented with an additional set of videos or video quality issues that they did not encounter before, which in turn affected past decisions on the scores;
As this is not a test for speed, with bigger sets of data participants might want to go through the evaluated materials and make sure it was consistent and logical throughout the whole process.

Examples of possible reasons for lower video quality

Here we can see some examples of the most commonly observed video quality issues, but there can always be many more issues that participants should pay attention to.

During evaluation, it is important to pay attention to every detail of the video:

Overall video, to see an overall blur, blockiness, color-related artifacts, etc.
Separate elements, such as hands or moving objects, to see motion blur, etc.
Corners, borders, silhouettes, to see noise, color bleeding, etc.
Small details in the background ( e.g. light switch, flowers), to see if details are still visible and do not blend together with the background.

Possible aspects affecting image quality:

An overall blur of the video
Motion blur of specific element - usually observed on fast-moving objects
Blockiness - whole image or separate elements
Color bleeding and other color-related artifacts
Noise - overall noise, edge noise, etc.

Possible aspects affecting video fluidity:

Video freezes and stalls
Unstable frame rate - video jumps back or forward, skipping frames, frames duplicate or separate elements get duplicated in one frame

Examples of motion blur and overall blurriness

Common issues

There might be a tendency to keep the scores centralized, evaluating all the samples in a very close range of scores even if the quality differs. This can be due to the inexperience of the participant, a fear of being wrong (although there is no wrong answer in Subjective evaluation), or any other psychological factor. This issue can, and should, be addressed during the training stage, by giving participants examples of different quality range examples and a chance to score those examples, to get used to using the full range of the scoring system.
Participants might be tempted to peek at other participants' scores. The reason might be just to make sure someone else has a similar opinion, or again - fear of being wrong. To avoid this behavior, it is best not to show other participants the rest of the answers and hide the average scores before the evaluation event is over.
There might be situations when participants can’t be presented with examples of different quality ranges, or even a source video. In this case, it is recommended to invite experienced participants, who already have some understanding of what to pay attention to during evaluation, and what artifacts can indicate quality drops.

Presentation of the results

After the evaluation event has ended, the results are collected and the average score is calculated for each sample to get the MOS score. If needed, individual scores can be reviewed to remove any unreliable subjects from the data, but this must be done only in extreme situations as any score can be a valid representation of a user experience. From the collected MOS scores, further analysis is done by identifying trends, issues, areas of improvement, and correlation with other video quality metrics.

The confidence interval, of the statistical distribution of the assessment grades, should also be given with the MOS score. The expected confidence interval should be around 95%, which lets us know if unreliable participant scores are present.

In the reports, any additional information about the evaluation environment must be provided, including evaluated materials, participant count, and original and adjusted scores before and after the elimination of one or more participants.

Data analysis

Some of the default metrics we use are:

MOS Score: average of the whole value set
Confidence interval:
- Standard deviation: a measure of how dispersed the data is in relation to the mean
- The lower and the upper limit of the Confidence interval
MOS Score with excluded outliers: the average of the data points in the set value range, excluding values that are considered outliers using the interquartile range rule.
Correlation with other objective video quality metrics.

Key takeaways

There is no set-in-stone rule on which video quality evaluation method to use, but having the option to get high-quality results for real user-perceived quality in multimedia experiences is an advantage. Although a Subjective MOS video quality evaluation approach isn't always the first choice when it comes to a big set of videos when we need a quick, precise, and objective result, there are situations when MOS can provide us with more realistic and in-depth analysis of video quality.

The main challenge is making sure that participants have the needed experience, and that the evaluation event is set up according to the requirements. To overcome this challenge, TestDevLab runs training sessions for the participants that prepare them for any upcoming video quality evaluation events, as well as providing them with the needed equipment and environment required for the evaluation events.