How We Apply the MUSHRA Methodology for Listening Tests

In an age where it seems that almost everything can be measured, calculated, and evaluated as objectively as possible, subjective listening tests still hold their ground and remain a crucial part of audio quality testing. While conducting a plain listening test with a small batch of random participants does not guarantee a fair view of the audio quality assessment from the standpoint of the general population, having a consistent test environment and following standardized evaluation methods greatly improves the objectivity and delivers results that even the toughest critics can agree upon. One such method, defined in the recommendation BS.1534-3 by the ITU-R (International Telecommunication Union, Radiocommunication Sector), is the MUSHRA methodology.

What is MUSHRA?

MUSHRA is an acronym that stands for Multiple Stimuli with Hidden Reference and Anchor. It is a subjective listening test methodology that is most commonly used to evaluate the perceived sound quality for systems in various technology domains. Examples of such systems are audio codecs, headphones, speakers, apps, and software that are capable of media playback.

MUSHRA was first introduced to the world in recommendation ITU-R BS.1534-0 in late June of 2001 and has been one of the most revised and well-described subjective audio quality evaluation methodologies. Currently, the 4th revision of this recommendation, ITU-R BS.1534-3, is in force (approved in October 2015).

When is it appropriate to use this testing methodology?

Now that we have accepted the importance of subjective listening tests, and know that there is a methodology to follow, we are ready to crack into conducting the tests, right? Unfortunately, no…

The often disappointing reality is that MUSHRA is not the only subjective audio evaluation method specified by ITU-R, nor by other organizations. In fact, the readers, who previously might have looked into the vast amount of recommendations issued by ITU-R, know that there are at least a handful of subjective audio evaluation methodologies that seemingly are the same, however, contain slight nuances, depending on various factors. Some of these methodologies, and their differences, are given a quick look further on.

So, when is it appropriate to use MUSHRA tests?

As stated in the recommendation itself: “MUSHRA is to be used for assessment of intermediate quality audio systems” (ITU-R BS.1534-3, p.3). To better understand what is considered an “intermediate audio quality system”, yet another quote from the recommendation can be used: “...that new kinds of delivery services such as streaming audio on the Internet or solid-state players, digital satellite services, digital short and medium wave systems or mobile multimedia applications may operate at intermediate audio quality;” (ITU-R BS.1534-3, p.1).

Now, how can we determine if the system produces small, intermediate, or significant audio quality impairments?

Generally, the rule of thumb is - if the audio quality issues do not affect the listening experience of the user and are mostly perceivable only by trained experts - it is a small impairment. If the audio quality issues completely ruin the listening experience, and make the audible content hard to understand, even to an untrained person - it is a significant audio quality impairment. If the system and the audio quality behavior are positioned somewhere in between the range mentioned above - it is an intermediate audio quality impairment.

To add more context to this - 9 times out of 10, audio issues that are considered as small impairments are caused by the acoustic properties or the physical limitations of the system under test, and generally do not impact the listening experience.

An illustrative example of such systems might be a speaker that produces a certain undertone or slight distortion when it is playing at a high volume level; or a noise-canceling headphone that lets through the smallest amount of background noise at playback but doesn't affect the quality of the playback itself. In such occasions, the ITU-R BS.1116-3 recommendation could be followed instead.

However, if it is expected that the system under test might produce audible defects and a noticeable drop in quality, that affects the listening experience - ITU-R BS.1534-3 recommendation must be followed and MUSHRA tests should be applied.

An illustrative example of systems with intermediate audio impairments could be a video conferencing service that produces audio artifacts such as pops, glitches, and occasional dropouts as the conversations are held; or audio streaming services that limit the frequency bandwidth during playback according to the network conditions.

It's important to note that these examples are for illustrative purposes, and the categorization of impairments may vary depending on the specific context and the degree of impact they have on audio quality - each case should be approached individually.

Process of a MUSHRA test

Listening panel

Firstly, a group of participants needs to be established - ITU-R recommends that a listening panel of 15 to 20 experienced listeners should be sufficient. Listeners should go through a pre-screening process according to the ITU-R BS.2300-0 recommendation. Essentially, this recommendation is used as a checklist to make sure that listeners are experienced, reliable, and competent enough to participate in the audio quality issues and evaluation process. The listening panel takes part in a pilot test with the goal of getting acquainted with the range of test objects and audio artifacts. Only the scores of experienced listeners (according to the aforementioned recommendation) are included in the final data analysis.

Test material

A single MUSHRA listening test trial consists of:

A reference signal - the original test object in its purest form, usually marked and separated from other stimuli [stimuli - something that causes a reaction in participants, the sounds intended for evaluation in this case].

Reference