Case Study - FMEA and Risk Coverage

Roman Eysn

2025-08-18

Case Study: FMEA and Risk Coverage

Recently I have been working with a test engineer to figure out where a software system needs better test coverage. Both the engineer and I were relatively new to the team, which previously had no dedicated tester.

While the development team had done a great job of writing automated tests, there was not yet a clear strategy for what tests to write for which features. As the application needed to integrate with several other systems, the need for a test plan that covered several systems was clear.

We set it as a goal to map out what additional tests were needed.

The Process

There were 4 phases to this process:

Write a list of existing tests, automated and manual
Analyse quality risks
Create a Failure Mode and Effects Analysis (FMEA) matrix
Map the list of existing tests to the FMEA matrix, evaluating coverage

Phase 1: Existing Tests

This is the most straightforward part of the overall plan. We simply took a look at all tests (other than unit tests) that had been written, and all manual tests, which were already maintained in a test management tool.

Next we went through the list and described what each test was doing and what it was asserting. This would later make it easier to judge what risks are covered by each test.

Phase 2: Quality Risks

Typically, I do this when setting out to write a test strategy at the beginning of a project. A quality risk analysis document had not been created for this project yet. So this was our next task.

Why is this needed?

For our purpose, we needed an overview of what parts of the system were already tested, and where there were gaps. But there is no way to do that without having some sort of map of the overall system.

One way to do that is to look at the list of features in an application. That is, however, misleading, because features only describe the functionality of a system, not the non-functional aspects. For example, looking at the list of completed features, which one describes the performance requirements? Usability? Security? Reliability?

A quality risk analysis document instead gives a more complete overview of a system's quality requirements, including non-functional requirements.

When I create a list of quality risk criteria, I usually look at two sources:

The ISO 25010 standard, which describes quality characteristics and sub-characteristics
A list of all software components, typically derived from the source code (there are tools that can help with this, but if the code is well-structured, sometimes it is enough to just look at the namespaces or package/folder structure)

Next, we went through the list of quality characteristics and sub-characteristics and discussed which ones were relevant for our system (you will want to do this with stakeholders). We then created a list of quality risk categories for relevant quality characteristics.

Here is an example of what this looks like:

Quality Risk Category	Description	Effected Quality Criteria
Core Functionality	The system does not provide the required functionality	Functional Suitability
Load and Capacity	The system does not perform well under load	Performance Efficiency
Performance	Operations take too long to complete	Performance Efficiency
Integration	Failure to integrate properly with other systems	Functional Suitability, Interoperability
Authorization	Users can/cannot trigger actions, for which they do not/do have permission	Security
Reliability and Availability	The system is not available as dictated by the SLA	Reliability
Recoverability	Data / system state cannot be recovered after a failure	Reliability
Data Quality	Data is not valid, complete, or consistent	Functional Suitability, Reliability
...	...	...

A few notes:

We deliberately left out some quality characteristics, which were relevant for the application, but not for our testing purposes
There are different ways to structure the quality risk categories, but we found that this format worked well enough for us

Phase 3: FMEA Matrix

This was the most time-consuming part of the process. We created an Excel table, which listed all quality risk categories and their relevant risks / effects. The following structure is taken from the book "Critical Testing Processes" by Rex Black. I have mixed feelings about the book, but the introduction to FMEA I find very good and the author gives useful examples.

Here is what the FMEA matrix can look like: (Obviously, I cannot share the actual document here - below is a generic example)

ID	Quality Risk Category	Risk / Effect Description	Severity (1-5)	Priority (1-5)	Probability (1-5)	Risk value	Recommendation	Test Type
1000	Core Functionality	The system does not provide the required functionality
1001		Distance Calculation	2	1	3	6	Cover by multiple test types: E2E, Integration, manual test cases	Auto. and manual
1002		Cost Calculation	2	2	2	8	Cover by multiple test types: E2E, Integration, manual test cases	Auto. and manual
2000	Load and Capacity	The system does not perform well under load
2001		System fails at peak times during workday (50+ conc. users)	1	1	4	4	Load test for resource intense core functionality with 100+ concurrent users	Load
...	...	...	...	...	...	...	...	...

Phase 4: Mapping Tests to FMEA Matrix

Taking the list of all existing tests and the FMEA matrix, we added another column to the matrix for "Coverage". This column would contain an estimated percentage value reflecting how much of the risk is covered by the existing tests. At 0%, no existing tests cover the risk, at 100% the risk is judged to be adequately covered by existing tests.

Lastly, we took all rows that had a low "Risk Value" (eg. <= 12) and a low "Coverage" (eg. <= 50%) and discussed what additional tests would be needed to cover the risk. These were the factors we would then prioritize for writing new test cases (automate it if feasible, or add manual test cases if not).

Conclusion

I was very happy with the outcome of this process. We ended up with a clear overview of what risks were not adequately covered, and an prioritized plan of what to address next.

I also found this approach to be relatively quick to implement. Test plan design and implementation can take a long time, and I have seen efforts where it took months to get to the point of actually writing the first tests. Here, we got to implementation work in a matter of days. Granted, in larger projects you want to involve more people, need several feedback cycles, etc.

But I am a strong believer in fast feedback loops, and this approach allowed us to get test results quickly.

Sources

Black, R. (2004). Critical Testing Processes: Plan, Prepare, Perform, Perfect. United Kingdom: Addison-Wesley. https://a.co/d/gm1wG7F