Case Study: FMEA and Risk Coverage
Recently I have been working with a test engineer to figure out where a software system needs better test coverage. Both the engineer and I were relatively new to the team, which previously had no dedicated tester.
While the development team had done a great job of writing automated tests, there was not yet a clear strategy for what tests to write for which features. As the application needed to integrate with several other systems, the need for a test plan that covered several systems was clear.
We set it as a goal to map out what additional tests were needed.
The Process
There were 4 phases to this process:
- Write a list of existing tests, automated and manual
- Analyse quality risks
- Create a Failure Mode and Effects Analysis (FMEA) matrix
- Map the list of existing tests to the FMEA matrix, evaluating coverage
Phase 1: Existing Tests
This is the most straightforward part of the overall plan. We simply took a look at all tests (other than unit tests) that had been written, and all manual tests, which were already maintained in a test management tool.
Next we went through the list and described what each test was doing and what it was asserting. This would later make it easier to judge what risks are covered by each test.
Phase 2: Quality Risks
Typically, I do this when setting out to write a test strategy at the beginning of a project. A quality risk analysis document had not been created for this project yet. So this was our next task.
Why is this needed?
For our purpose, we needed an overview of what parts of the system were already tested, and where there were gaps. But there is no way to do that without having some sort of map of the overall system.
One way to do that is to look at the list of features in an application. That is, however, misleading, because features only describe the functionality of a system, not the non-functional aspects. For example, looking at the list of completed features, which one describes the performance requirements? Usability? Security? Reliability?
A quality risk analysis document instead gives a more complete overview of a system's quality requirements, including non-functional requirements.
When I create a list of quality risk criteria, I usually look at two sources:
- The ISO 25010 standard, which describes quality characteristics and sub-characteristics
- A list of all software components, typically derived from the source code (there are tools that can help with this, but if the code is well-structured, sometimes it is enough to just look at the namespaces or package/folder structure)
Next, we went through the list of quality characteristics and sub-characteristics and discussed which ones were relevant for our system (you will want to do this with stakeholders). We then created a list of quality risk categories for relevant quality characteristics.
Here is an example of what this looks like:
Quality Risk Category | Description | Effected Quality Criteria |
---|---|---|
Core Functionality | The system does not provide the required functionality | Functional Suitability |
Load and Capacity | The system does not perform well under load | Performance Efficiency |
Performance | Operations take too long to complete | Performance Efficiency |
Integration | Failure to integrate properly with other systems | Functional Suitability, Interoperability |
Authorization | Users can/cannot trigger actions, for which they do not/do have permission | Security |
Reliability and Availability | The system is not available as dictated by the SLA | Reliability |
Recoverability | Data / system state cannot be recovered after a failure | Reliability |
Data Quality | Data is not valid, complete, or consistent | Functional Suitability, Reliability |
... | ... | ... |
A few notes:
- We deliberately left out some quality characteristics, which were relevant for the application, but not for our testing purposes
- There are different ways to structure the quality risk categories, but we found that this format worked well enough for us
Phase 3: FMEA Matrix
This was the most time-consuming part of the process. We created an Excel table, which listed all quality risk categories and their relevant risks / effects. The following structure is taken from the book "Critical Testing Processes" by Rex Black. I have mixed feelings about the book, but the introduction to FMEA I find very good and the author gives useful examples.
Here is what the FMEA matrix can look like: (Obviously, I cannot share the actual document here - below is a generic example)
ID | Quality Risk Category | Risk / Effect Description | Severity (1-5) | Priority (1-5) | Probability (1-5) | Risk value | Recommendation | Test Type |
---|---|---|---|---|---|---|---|---|
1000 | Core Functionality | The system does not provide the required functionality | ||||||
1001 | Distance Calculation | 2 | 1 | 3 | 6 | Cover by multiple test types: E2E, Integration, manual test cases | Auto. and manual | |
1002 | Cost Calculation | 2 | 2 | 2 | 8 | Cover by multiple test types: E2E, Integration, manual test cases | Auto. and manual | |
2000 | Load and Capacity | The system does not perform well under load | ||||||
2001 | System fails at peak times during workday (50+ conc. users) | 1 | 1 | 4 | 4 | Load test for resource intense core functionality with 100+ concurrent users | Load | |
... | ... | ... | ... | ... | ... | ... | ... | ... |
Phase 4: Mapping Tests to FMEA Matrix
Taking the list of all existing tests and the FMEA matrix, we added another column to the matrix for "Coverage". This column would contain an estimated percentage value reflecting how much of the risk is covered by the existing tests. At 0%, no existing tests cover the risk, at 100% the risk is judged to be adequately covered by existing tests.
Lastly, we took all rows that had a low "Risk Value" (eg. <= 12) and a low "Coverage" (eg. <= 50%) and discussed what additional tests would be needed to cover the risk. These were the factors we would then prioritize for writing new test cases (automate it if feasible, or add manual test cases if not).
Conclusion
I was very happy with the outcome of this process. We ended up with a clear overview of what risks were not adequately covered, and an prioritized plan of what to address next.
I also found this approach to be relatively quick to implement. Test plan design and implementation can take a long time, and I have seen efforts where it took months to get to the point of actually writing the first tests. Here, we got to implementation work in a matter of days. Granted, in larger projects you want to involve more people, need several feedback cycles, etc.
But I am a strong believer in fast feedback loops, and this approach allowed us to get test results quickly.
Sources
- Black, R. (2004). Critical Testing Processes: Plan, Prepare, Perform, Perfect. United Kingdom: Addison-Wesley. https://a.co/d/gm1wG7F