Rule-based UX Evaluations – undervalued methods

TL;DR: Rule-based evaluations for usability are an efficient way to improve interfaces

There are a lot of designs that could be improved. The “right” way to do this is testing with users while applying the appropriate method for the appropriate questions. After all, User Experience design is an empirically based discipline; so it is claimed and framed that UX without users is no UX2.

However, if this is the (only) way to go, it is no wonder that usability evaluations are done rarely. Doing research with users is a costly activity: You need to find testers from the same population as your assumed users, the testers should be paid adequately for their time and you need someone trained to do the tests and to analyze them.

I am not against empirical user research—it is an important part of my job. But while getting professionals and users might not cause larger worries for the Autodesks and Googles of this world, a lot of small companies or community-ran open source projects do not have the resources. Some can barely pay a few developers. And I admit freely that I, as a UX research professional, do not test my personal open source projects in formal empirical user tests, either.

Among methods for improving UX are also some that do not test with users, but allow to find possible problems on a budget and in a way open to scrutiny: Rule-based evaluation methods.

Rule-based evaluation methods prescribe a process for evaluating the interface and for creating a summarizing representation of the findings.

To apply rule-based evaluation methods you need some experience or training, but they are easier to learn than empirical testing which is often based on half a psychology major worth of learning. Also, no participants will be hurt in the application of rule-based methods in case they are applied wrongly.

Rule-based evaluation methods work differently and find partly different problems than user tests 8. In a comparison of interface evaluation methods, heuristic evaluation identified most problems for a rather low cost, whereas empirical usability testing did well on finding the serious problems but for a high cost and while missing problems caused by inconsistency in the interface 3.

Here a brief overview of methods:

Heuristic Evaluation

Heuristic evaluation is a classic and has been around since at least 1990. Several UX experts use a list of design principles to find possible problems and pool their findings. The best cost/benefit is doing the evaluation with 3-5 evaluators. If you have resources for more: invest into complementary methods 7.

I usually use the heuristics by Nielsen 1. One can also create custom lists for specific domains 4.

To apply the heuristics, you need to define a task that a user would do, so you can evaluate for problems that the user meets while doing this task. The task is usually chosen because it is frequently done when using the software and/or crucial for its use.

Then, every evaluator does the task while consulting the list of heuristics to identify problems and their causes. Just having the list does not ensure good application — it is a tool rather than an algorithm. Finding the problem can be based on the list (Like: “Considering the ‘match real world’-Rule, I note that this is not phrased in the user’s language“) or evaluators find the problem and use a heuristic to explain it (Like: “I got stuck here… hmm, it can be explained with the ‘consistency’ rule, since the interaction needed to proceed is non-standard‘” ).

After going through the tasks, the evaluators pool their observations, since no single evaluator can find all (or even a large fraction) of the problems. The result is a pooled list of issues and which rule violations they might be caused by.

UI Guidelines

  • People: At least 1 person can be UX or Developer.
  • Materials: The UI guidelines and an interface to evaluate
  • Outcome: List of violations and possibly an alternative pattern from the guidelines for each of them.
  • More: Guidelines for Windows, Google products, Gnome Desktop

UI Guidelines are often part operating System ecosystems. They summarize how software should look and behave.

You find sections about how applications should be like, how to think about their users and how to approach decisions for or against features – more on a product and architecture level. They also contain guides of how a single repeated component, like a button, should behave, what states it should have and particularly when you can and can not use it. As well, it outlines how such elements can be combined in dialog windows or toolbars. This makes UI guidelines very easy to use for development. They can be very prescriptive, but that is the point: They standardize, so that users need to learn things only once. That UI guidelines are prescriptive also makes them relatively easy to use for non-experts.

If you design an application or extension for a platform, see if it has such guidelines and try to follow them.

Similar to UI Guidelines are design systems and pattern libraries. However, in many cases they are less like a textbook and more like a collection of UI elements and thus do not document the reasons for design decisions and how they shall be used. They are still useful, and in case the design systems gets used a lot it might be worth to supplement it with instructions similar to the above linked guidelines.

PURE

  • People: 2 or more UX Experts
  • Materials: Persona and their top tasks
  • Outcome: Overall score + worst problem indication
  • More: PURE on measuringu.com

I must admit, I have yet to use the PURE method. PURE stands for “Practical Usability Rating by Experts”. It is similar to the heuristic analysis (but does not use heuristics); it produces a rating score. You use a persona’s top tasks to define what user tasks to evaluate. Then you decompose the tasks into steps the user has to take. In case of doubt what the path is, look at previous tests of possible web analytics data. Then, for each of the steps estimate the cognitive load for the persona from 1 (easy, few cognitive load) to 3 (very hard, a lot of cognitive load). Pool the scores of the reviewers and try to solve disagreements by explaining your reasoning when evaluators have given different scores.

The result is a score for the whole task calculated by adding the agreed-on cognitive load ratings for each step and a color from 3/red over 2/yellow to 1/green signifying the worst score in all steps for this task (so if your scores for 4 steps of a task were 1-1-3-2, your final task color would be red, because step 3 was rated with 3/red and is the lowest in the set)

PURE seems to be a bit more structured than heuristic evaluation and I assume it needs a bit more expertise to estimate “cognitive load” (Which is an actual concept from learning psychology). The resulting score makes it probably useful for working with people who like quantification and comparing results.

KLM

KLM does not find usability problems. It quantifies efficiency. Like in PURE or heuristic evaluation, you determine a task you want to apply the method to. Then you decompose it into input and mental actions, so called “Operators”: Needed keystrokes, mouse clicks and time to think (thinking being the tricky part). Based on rules you simplify the chain of needed operators and get to final result of seconds an expert would need to accomplish the task. Thus, KLM is useful to show that one change is likely to be more efficient than another, to show the benefits of keyboard accelerators or to get some common ground in “But your new function is much less efficient than…” remarks from users.

A problem remains…

If there is no trust in expert judgments, even if they are made accountable by above methods, it is hard to use the methods. Usability testing can “hide” its expert involvement a bit by presenting the “real users” as source of the data, but test setup, moderation etc. has been done by experts, too.

If you work in a context where these methods are accepted, they can supplement usability tests if you do tests already and provide a viable alternative where empirical tests can't be done, currently.


Notes

  • 2023-02-03: One reason for the rare application of rule based methods might be that their use is a skill that is hard to claim and defend in contrast the the respective alternative methods. The alternative method to a heuristic analysis are empirical usability tests; these are easier to defend to be exclusive territorry of UX (research) experts and non-experts using them wrong can be themseves easily attacked based on introduced biases or inefficiencies: There is a metaphorical catalog of things that can be done wrong wrong in usability tests (partly building on the academic methods of psychology) and UX researchers know that catalog fairly well. This enables them to say that others should not do the tests and make the mistakes and thus waste time and money and cause wrong results; thus, research professionals should do the testing and be payed for it.
    In contrast to empirical tests heuristic analysis are relatively cheap, you can cause no harm to participants using them and while you also can do things wrong there, there is no estabilished catalog of mistakes to make. These seem to be like advantages but they are disadvantages for the task-claims of UX professions. Thus, while originating in usability research, UX professionals might not have much to gain advocating for rule based methods, but have to gain professional power in empirical tests with users, which also provides an explaination for “UX without users [i.e. empirical research with users] is not UX” despite even empirical evaluations showing that yes, you can improve UX with rule based methods i.e. “without users”. Interesting literature in this context is Abbotts “The System of Professions” (which does not dicuss IT, but general mechanisms of professional conflicts)
Creative Commons License
Rule-based UX evaluations by Jan Dittrich is licensed under a Creative Commons Attribution 4.0 International License.

  1. Nielsen, Jacob. 1994. “10 Heuristics for User Interface Design.” April 24, 1994. http://www.useit.com/papers/heuristic/heuristic_list.html

  2. This is not new, “[Rule based evaluation] has been seen as inferior by most researchers“, claim Nielsen and Molich in their 1990 article 7

  3. Consistency means that UI elements and interactions look and work according the same principles in different parts of the product. It is no wonder that one time testers do not find these issues since they do not know many different parts of the product. Consistency is very valuable and pays off as soon as real users use the whole product and not just some functionality, enabeling them to predict how the product will behave. 

  4. Which I never did so far. It can be done by reviewing competitor products and doing user testing and deriving principles from this. I suppose this is a lot of work which might be worth it, if you work in a specific domain and do these analyses more often. The process of doing this seems to be described in 5, but I could not find the thesis online. 

  5. Dykstra, D. J. 1993. A Comparison of Heuristic Evaluation and Usability Testing: The Efficacy of a Domain-Specific Heuristic Checklist. Ph.D. diss., Department of Industrial Engineering, Texas A&M University, College Station, TX. 

  6. Jeffries, Robin, James R. Miller, Cathleen Wharton, und Kathy Uyeda. 1991. „User interface evaluation in the real world: a comparison of four techniques“. In , 119–24. ACM. https://doi.org/10.1145/108844.108862

  7. Nielsen, Jakob, und Rolf Molich. 1990. „Heuristic Evaluation of User Interfaces“. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 249–256. CHI ’90. New York, NY, USA: ACM. https://doi.org/10.1145/97243.97281

  8. Nielsen, Jakob. 1995. “Usability Problems Found by Heuristic Evaluation.” Nielsen Norman Group. January 1, 1995. https://www.nngroup.com/articles/usability-problems-found-by-heuristic-evaluation/