Providing video quality testing as one of our services at TestDevLab, we understood the necessity for an algorithm that could evaluate image and video quality in a way that closely matches human perception. Though various image quality assessment algorithms already existed, like BRISQUE, correlation was low, the possibility for errors was high, and automated testing was out of the question.
This inspired us to develop our own no-reference algorithm for video quality assessment—VQTDL. In this blog post we will discuss this algorithm in greater detail, explain how it works, compare it to other algorithms, and explore the many opportunities and benefits it brings to video quality testing. Let’s get started.
What is VQTDL?
Video Quality Testing with Deep Learning—or VQTDL—is a no-reference algorithm for video quality assessment. This solution produces image quality predictions that correlate well with human perception and offers good performance under diverse circumstances, such as various network conditions, platforms and applications.
How did we create VQTDL—and why?
Video has become an integral part of our lives. Whether we find ourselves using communication, collaboration, or streaming applications, video always seems to be at the forefront of these applications. Even more so after the COVID-19 pandemic changed the way people work, communicate, and stay entertained. In a survey carried out by Gartner, 86% of organizations have incorporated new virtual technology to interview job candidates, while people using streaming applications have a low tolerance for a bad quality stream, with more than half of all users abandoning a poor-quality stream in 90 seconds or less, according to Tech Radar research.
This constant exposure and reliance on video—in both professional and personal settings—means it is now more important than ever to make sure video quality meets user expectations. This means no blur, no delays, and no distortions. To achieve this, video quality testing is crucial, if not the only viable solution. However, measuring visual quality is a difficult task in many image and video processing applications. It requires efficient algorithms that provide image and video quality evaluations that closely match human judgment regardless of the type of video content and severity of the distortions that might occur.
This is what inspired to create a no-reference algorithm for video quality assessment that would be able to effectively test video quality in different use cases.
So, how did we do it?
To create VQTDL, we looked at different deep learning techniques—specifically the Transformer-CNN model with TRIQ architecture—and how they were applied in video communications quality testing. Our aim when developing this solution was to produce image quality predictions that correlate well with human perception and offer good performance under diverse circumstances, such as various network conditions, platforms and applications.
We focused on two main areas when developing and fine-tuning our solution—training and testing. For the training process, we developed two VQTDL models. One was to simulate problems that could arise from using BRISQUE, and the second was to check the image quality. To test the efficiency of VQTDL, however, we looked at various video quality assessment use cases and compared them against BRISQUE and subjective evaluation scores.
We will look at the training and testing processes in more detail a bit later on. First, let’s look at the VQTDL architecture.
VQTDL architecture
We implemented architecture based on TRIQ and conducted the experiments using a scientific computing framework called TensorFlow. To define neural network layers and filters, VQTDL uses a residual network called ResNet-50 as a backbone. For handling arbitrary resolutions, we defined positional embedding with maximal image resolutions in the training datasets—which are later truncated for smaller images. Afterwards, flattened image patches are passed to the transformer encoder.
In the final step, the multilayer Perceptron (MLP) head receives information from the first output vector that includes aggregated information from the transformer encoder for image quality perception. For quality distribution, five filters—for each quality distribution value—are used in the last fully connected layer with the softmax activation function. Therefore, to calculate the distance between the predicted image quality distribution and the ground-truth distribution, cross-entropy is used. We can see the simplified architecture of VQTDL with the transformer encoder below.
Experimental results
Like we mentioned a bit earlier, to perfect our video quality assessment algorithm, we needed to run some experiments. Namely, we first needed to train and later test our algorithm to see how it performs in different use cases and in comparison to other image quality assessment models. Here are the results.
Training
For the training process we developed two VQTDL models—16.11.83 and 10.12.59. The VQTDL model 16.11.83 is built using data from half of the old BRISQUE subjective model and four tests from automation setups (∼500 new images), having 3,231 images in total (see Fig. 2). This model was built to simulate where problems using BRISQUE would arise, such as sensitivity to UI changes, unstable prediction values, and the need for manual fine-tuning for more accurate prediction results.
The second VQTDL model, 10.12.59, is built using data from a 9-second video recording of the monitor from the sender device and then passed to degradation script, generating 1,696 images in total. It is built to check the quality of the image received, taking into account the artifacts from the camera used to record the monitor screen with the playback video. This is why the ranges for this model are much higher than the previous one, since the original video is already a degraded version of the video played on the monitor screen.
To analyze training results, we used the following metrics: training and validation accuracy, training and validation loss, and three commonly used performance metrics—the Pearson linear correlation coefficient (PLCC), root mean square error (RMSE), and the Spearman’s rank correlation coefficient (SRCC). Training and validation accuracy and loss allowed us to analyze the overall training process.
Table 1: Training results
Models | Training accuracy | Training loss | PLCC↑ | RMSE↓ | SRCC↑ | Validation accuracy | Validation loss |
16.11.83 | 0.9111 | 0.7331 | 0.879 | 0.3861 | 0.863 | 0.7372 | 1.0053 |
10.12.59 | 0.8714 | 0.7342 | 0.9858 | 0.2655 | 0.9779 | 0.8547 | 0.7515 |
Testing
We compared VQTDL against BRISQUE and subjective evaluation scores. The reason why we only compared our solution to a no-reference image quality assessment model is because full-reference quality metrics such as VMAF require reference images to be strictly aligned with the distorted images that do not exist in many real-life applications.
To get an overall understanding on how VQTDL models perform under various circumstances, we gathered videos from different video call applications that are currently in the market on various network conditions, platforms and applications. Next, we subjectively evaluated the full-length video (not a single image) and marked the score down. After all the scores were done for all the videos, we ran BRISQUE and VQTDL. We used the same image size for every test to make sure both VQTDL and BRISQUE were under the same conditions.
On video call streams, the position of the video feed that we want to test often changes depending on the app and platform that is being tested. This is why we also checked the stability against UI changes or misalignments on cropped areas while doing automatic position detection.
During our tests, we proved that VQTDL can handle images with arbitrary resolutions. Specifically, the prediction values are more stable and closer to the subjective feeling than BRISQUE, and VQTDL handles changes better on the UI of the application tested.
The average PLCC results for VQTDL were higher than 96% for all network conditions. Namely, for Android-Android, VQTDL results were higher than those of BRISQUE by 14%, Android-iOS by 3% and iOS-iOS by 8%. See the graph below.
We also tested VQTDL and BRISQUE performance when there are more people on the call. In our case we tested calls with two, four, and eight people. It is important to compare both solutions to subjective evaluation scores on different amounts of people in the calls. This is because image quality and changes in the image size can and do affect perceived quality. PLCC and average error scores on group calls for both VQTDL and BRISQUE are combined in the graph below.
VQTDL achieved higher PLCC results than BRISQUE by 12%, 35%, and 40% in calls with two, four, and eight people, respectively. Compared to BRISQUE, the average error scores for VQTDL were 19%, 8%, and 75% lower in calls with two, four, and eight people, respectively.
We also performed a stability test where we used a 3-minute-long test video with two people on the call in two networks (150KB and 2MB). For the stability test, we changed the position around along with the cropped size area. The maximum change was 20% of the original size image—10% up or down.
The first test was done on a 150KB video, while the second test was done on a 2MB video. For VQTDL, the standard deviation over all changes result was 33% lower than BRISQUE in the first test and 14% lower in the second test.
Areas of application—in which cases can you use VQTDL?
VQTDL was developed for video quality assessment use cases, such as real-time streaming and video conferencing with WebRTC products.
With this architecture we are able to process videos in an automated fashion by just providing the video file to the server with a Rest API call. After processing is complete, a 3-minute video will typically take around 30 seconds for a 4CPU/8GB of ram intel processor. Results are then returned as 1 MOS per second of the video—this value can be changed by decoding the video at a higher frame rate. Other video metrics can be returned by the server, but that is outside the scope of this blog post.
A newer and more efficient approach to video quality assessment
After carefully and meticulously developing this video quality assessment solution and comparing it to other no-reference solutions—BRISQUE—we are certain of its efficiency in assessing video quality in different conditions. Looking at the results, we can see that VQTDL can handle images with arbitrary resolutions. Namely, VQTDL handles changes better on the UI of the application tested and prediction values are more stable and closer to the subjective feeling than BRISQUE.
Another advantage of VQTDL is that there is no additional video processing needed before passing video to VQTDL evaluation, which is not the case when using BRISQUE. Though it is possible to reach very high correlation and low error scores by fine-tuning BRISQUE for each of the scenarios, it will make automated testing almost impossible, as we would have to set different parameters depending on the video layout for each of the applications. It is also very sensitive to UI changes.
Even though we have achieved outstanding results in our tests which demonstrated the efficiency of VQTDL, there is always room for improvement. Just like any other solution that is based on deep learning, VQTDL models learn from the data we provide. That is why we are working on building new datasets that cover most use cases that occur in video calls to build models that have even better performance in real-life scenarios.
Do you have a solution that relies on high video quality? See VQTDL in action. Contact us to find out more about our video quality assessment algorithm and how it can help you learn more about your video solution.