How many users are necessary to evaluate usability?

Table of Contents

The question about the number of users necessary to get enough feedback on your interface quality comes up every time one instigates a testing process. Do I need 5, 15, 30, 80 or thousands of users to improve my interface and complete the quality of experience that I consider delivering to my users? This post attempts to answer this question.

Introduction

This question is still the subject of much debate today. But did you know that it is a relatively old topic rooted in a story that began in the 80s? Indeed, Virzy and Nielsen first investigated this concern but what they concluded must be carefully contextualized. As a matter of fact, they are often misinterpreted as one is tempted to set up a number as a golden rule. But we show in this post that:

  • It is a false debate: 5 users, 15 users, 30 users, 80 users, thousands or even millions… this is overbidding and things are much simpler,
  • A distinction must be done between an evaluation context and another to perform it in the appropriate way.

Each of these numbers make sense… But not under the same conditions and for the same test objectives. To choose the right number of users, context must be properly defined and questions you need to ask yourself are two-folds:

  • What is the goal of my tests, what am I testing ?
  • Is it for qualitative of for statistical purposes ?

The first part of this blog describes how these questions are part of an iterative design process in a user-centered approach.

The second part sheds light on the subsequent problem: for each design stage, specific tests are engineered leading to questions about how big should your user sample be.

Finally, we test on a specific scenario the scientific well-known precept of: “you require 3-5 users to get 85% of critical concerns of your interface”.

The user-centered design approach: an iterative process

When it comes to develop a new interface or to design a new feature or product, a 4 stage iterative process should be setup:

Figure 1: The main steps of the interface design cycle.

At each stage, tests are carried out to:

  • assess the quality of the interface
  • evaluate the usability of the solution
  • quantify the volume of users
  • qualify the type of customers

Two points must be distinguished:

  • Qualitative evaluation: during stages 1, 2 and 3. Iterations should be operated to progressively but quickly remedy problems and send to development.
  • Quantitative evaluation: at stage 4 and over (product release), once the solution is deployed and on which we want to do analytics or statistics.

It all boils down to the simple formula: Number of users = f(test nature, design stage)

A more elaborated answer is two-folds:

‣ Before deployment = qualitative evaluation

At early design stage and before deployment, where qualitative evaluation is required, 3 to 5 users are needed:

“This answer has been the same since I started promoting discount usability engineering in 1989″ claims the pioneering work of Virzi (Virzi RA. Refining the Test Phase of Usability Evaluation: How Many Subjects Is Enough? Human Factors. 1992;34(4):457-468. doi:10.1177/001872089203400407), modeled by Nielsen (1993, https://dl.acm.org/doi/10.1145/169059.169166) and popularized by Nielson Norman Group in his well-known blogs (Why you only need to test with 5 users? or How many test users in a usability study?) agree:

To assess the ergonomic quality of an interface at a given time: between 3-5 users of the same typology (demographic criterion, profession, etc.) are sufficient.

Why?

  1. The probability to discover new problems on a same use-case decreases as increases the number of test users. In a word, every test users highlight the same errors. This leads to find 80% to 85% of critical errors with 3 to 5 users.
  2. The design improvement process is iterative. Therefore, once we have these initial feedbacks, we apply the changes, and we retest things. From the second round of tests, the issues will become rarer but deeper (issues with task flow structures, etc.) because a defective visual prevents the user from going further.
  3. Testing is expensive. Testing on fewer users but more often gives a higher cost / benefit ratio. It is better to have 3 improved versions tested successively with 5 users than a single one with 15 users. The insights will be more constructive in the first case and very redundant in the second case.

‣ After deployment or for specific tests

At release and post-release stages, quantitative evaluation is necessary and must be done on a consistent users sample size.

Two categories must be considered:

  1. Evaluation of user journeys with regard to certain quantitative metrics such as: journey time, journey errors, etc. In order to have analytics and starting making statistics, you need at least 30 users.
  2. Specific quantitative analyzes: eye tracking (around 40 people), A / B testing (around sixty at least).

Use-case: Value of a tester

The goal of this part is to evaluate the contribution of testers in a user testing approach of the usability of an interface.

Let’s buy a bicycle in LeBonCoin!

The use case we chose was the following: go on the Leboncoin’s website in order to find a bicycle for the scientific director in Rennes (UXvizer’s headquarters).

The 6 testers were aged between 21 and 36 years old and quite familiar with the website and its functioning. Their user journey has been recorded such as they can be analyzed according to:

  • some usability criteria (waiting time, transition types, scroll speed, wipes and patterns),
  • accessibility criterion (contrast ratio of the textual part) and,
  • visual criteria (colors, empty zones caused by loading time).

User Journey type

The main representative user journey – among the 6 testers – is based on the following 12 screens:

Figure 2: Main screens of user journeys of 6 testers.

Main attention points

By averaging the tests of the 6 users, 13 attention points have been found in the Leboncoin’s interface:

  • Contrast ratio: some words – especially the location – are in light grays which imply a low accessibility level on these words.
  • Colorimetry: the color palette in the screen can be erratic. Well, this can happen when there are photos.
  • Empty areas: because of the images loading process, holes appear in the middle of screens.
  • Patterns: linked to the search most of the time.
  • Inhomogeneous page transitions: wipes, fade-in, fade-out, fade-in, a mixture of all these transitions are used which lead to a lack of clarity. Let us note that this problem is solved in the app since the logo page is used with a dissolve manner to a context change.
  • Transition time: There is one user which had to spend 16% of its time because of transitions (weak connection signal).
  • Bug patterns in the loading process (repetitive pattern alternating: loading ‣ page loaded ‣ loading ‣ page loaded).
  • Waiting time (transitions + loading time) is quite present: on average the users wait for 10% of their journey.
  • 30 to 40% of time is spent to specify the localization and find answers to the request.
  • 10 steps are needed on average before reaching the answer of the first request (bicycle for women in Rennes).
  • So many little contexts are mainly associated with the loading process.
  • One user scrolls a lot and quite fast.

Testing approach: little is better than nothing

Now, here is the protocol we followed to check the rule: “3-5 users are enough to find 85% of critical issues.”

Randomly, one user is picked among the remaining testers and the additional found errors are accumulated on the previous ones. All the combinations of possible order of users have been executed (720 combinations in total) and an average curve has been computed. The bar plot in Figure 3 below shows the number of errors found at each additional tester.

Figure 3: Evolution of the number of found errors at each additional tester.

What we can observe is that:

  1. One tester procures a lot of benefits in usability validation. The gain between there is no user to test the interface versus there is at least one user to test is huge. Indeed, most of the critical errors (60%) are found with one user.
  2. Additional testers are beneficial since more errors are found but,
  3. The gain to add a new tester decreases with the number of users which have already tested the interface.

What to keep in mind?

For qualitative evaluation of your interface during the mockup, prototyping or development phase:

  1. One tester is better than none.
  2. You don’t need a lot of users: 2-5 users is enough. But, test your interface quite often and iteratively. Spare therefore time and money.
  3. Testing is an iterative process, so even if all issues are not found, at least the most critical ones will be found. If there are remaining ones, they will be found in another user testing round.
Share on linkedin
Share on LinkedIn
Share on twitter
Share on Twitter
Share on facebook
Share on Facebook