Test Automation - Distributed Systems

Automated Tests for Distributed Systems

In this article, I want to share some of my experiences with automated tests for distributed systems. I also want to give you ideas on how to tackle some of the more thorny challenges that come with automating tests for such systems.

Challenge 1: Test Data

One huge headache is setting up test data across multiple systems, where each system has its own data model and data storage. Data is usually coupled in some way, so that the data in one system is dependent on the data in another system.

Setting up the test data before execution can be a pain:

  • You have to know the data model of each system
  • You need a way to add the data to their respective data stores
  • You have to guess what the data should look like so that the systems can work together

The way I always approached this problem is to use the same processes to create and synchronize the test data that would be used in production. I.e. invoke the APIs of the systems to store data, instead of directly writing to the databases.

Sometimes you find that this is not possible, because the systems under test do not have the necessary integration points. In that case, I suggest to add the needed integration points, even if they are "just for testing".

Sometimes when adding features for testing, that same functionality is later used for production features.

Cleaning the data after a test run is handled using the same approach: remove data from all systems by leveraging the integration points that would be used in production. And again: if those integration points are not there, add them.

Challenge 2: Test Environment

One common practice to set up a test environments is to use "ephemeral" environments. This means that you create a new environment for each test run, which is then destroyed after the test run is complete.

While theoretically a great idea - and in practice often used successfully - it is not always easy to set up such environments. For example, you might find that setting up an entire software landscape with multiple large systems simply takes too long, is too expensive, or complicated.

And often the systems under test are not designed to be run in such an ephemeral way. A server application that only runs on Windows Server does not mix easily with Linux-based containers, for example.

What to do when setting up ephemeral environments is not feasible?

One approach is to use a shared test environment that is used by all tests. This is what is often done in practice by engineers not familiar with the concept of ephemeral environments. And it comes with its own set of challenges.

Test data management is one big challenge with these forms of "static" test environments. If you are not careful, you end up with flaky tests, "dirty" states with test data not being properly cleaned up, and other issues.

Challenge 3: Time

If your tests are interacting with a distributed system with asynchronous communication, you face the following question:

How long do you wait for a response from the system under test?

This is tricky, because sometimes tests never complete. For example when messages are being sent to error queues.

How to set timeouts is a matter of experience and trial and error, and the answer to "how long to wait?" is not clear at all.

Challenge 4: Systems not under Test

Here is a scenario that I have seen a few times:

An organization has existing systems that are not under test. Sometimes those are externally developed and hosted. Sometimes they are legacy systems that are not being maintained anymore.

Either way, the organization now integrates new systems, say, microservices, with those existing applications.

You cannot really run any tests against the new system landscape without also testing the old stuff. No matter how well you design the new services and how testable you make them, you will have to deal with the old systems.

Some Ideas that might work for you

Here are some ideas I want to share which I found useful in the past.

Test Data Management

Option 1: isolate test data via separate parent entities.

This works for cases where there is a clear hierarchy in the data model. A common example is a top-level "Tenant" entity that contains all data for a customer.

When setting up test data, create a new tenant and use that tenant for all test data. After the test run, you might delete the tenant (and if you are lucky your system cascade-deletes all relevant data), or you might just mark the tenant as "archived" and ignore it in future test runs.

Yes, the engineer in me screams too at the growing waste of disk space. But realistically, you can run a script once a month to delete stale data from test systems. And often the disk usage is so minimal that it does not matter.

Not clean, but pragmatic.

Option 2: identify test data via unique identifiers.

If option 1 does not work, then this one is an alternative that might: When generating test data, set an identifier that marks the data as belonging to the current test run.

This can be a UUID that you set into a "note" field, or even append to the name of the entity. After the test run, you can then delete all data that has that identifier.

Again, this is not a clean solution, but it works when you have few other options.

Option 3: remember test data.

I remember I did this once for a fairly involved distributed system, of which I had been the initial architect (so yes, I tried to test my own mess here). And before I go on: this approach was a real headache and worked not nearly as well as I had hoped.

When initiating the test run, I would keep a HashMap with all entities I created for the test. And after the test run, I would iterate over that map and delete all entities from the database.

Simple, right?

Not quite. Because the system under test created additional entities as needed. But it would not always delete this added data when removing the initial entities. As a result, I had to constantly fine-tune the cleanup code to also delete those "left-overs".

I thought this was a clean solution to a complex problem. Yet I ended up implementing so many workarounds, that I never used this approach again after.

Test Environment Management

Option 1: Just host everything

The best solution, in my opinion, if feasible: create a complete test environment with all the systems you need. In reality, I was able to use this approach once. And yes, it worked well enough.

But I am not sure what happened to this particular project once it scaled and more applications were added.

However, if you can create a complete test environment from scratch, go for it.

Option 2: Host what you can and share the rest

This approach I used a few times, and I was never happy with it. The parts of your overall system that you can host in an ephemeral way, you create and destroy as needed.

And the applications that cannot be hosted in such a way, you share between the tests. It is a compromise that works, but you have to pay careful attention to test data and to interference between tests.

Option 3: Share everything

For most companies I worked with, this was what they did. That does not mean I would recommend it.

But the reality is, most organizations do not run everything as microservices, even if that is their architectural goal. There is simply too many standard products that these services have to integrate with: an off-the-shelf ERP system, a cloud-hosted CMS, and this one legacy application that only runs on Windows Server 2000 and refuses to die.

As the one responsible for creating the test strategy and architecture, I might have employed every curse word in the dictionary against those systems. But in the end, I had to accept reality and created a shared test environment.

It is not pretty, but as long as you are careful with test data, it can work. A warning though: be especially careful when you also run manual and performance tests against those shared systems.

Distributed Systems and Timing

I only have one trick here, but it might help you: Do not just wait for a response from the system or for a specific message to arrive in a queue.

Integrate your test code also with

  • error queues
  • dead letter queues
  • monitoring systems
  • logs and anything else that can tell you what went wrong when a test fails or does not complete.

It is not only easier to debug the test, you can also add checks in your test that fail the test early. This can significantly speed up test execution.

Systems not under Test

Here are two ideas how to deal with systems that are not tested, and where you have no meaningful way to get them under test:

Option 1: Create Test Doubles

Test doubles can be Mocks, Fakes, Stubs, or any other type of test double that simulates the behavior of the system under test.

Having tried every type of test double (multiple times), I prefer to write Fakes. Fakes have several advantages, which I will discuss in a future blog post.

Whichever you use, the idea is to replace the untestable system with a lightweight version that can be controlled. Then the test can set up and run against the double without having to interact with the real system.

The huge downside is that you do not actually test the real application.

Option 2: Use the real system

If you have to test the real thing, and you have no control over it, then you might want to treat it as a black box.

Run your tests and assume the system behaves as expected. If you have means to check if it does not, then I recommend you do so and fail tests or log unexpected behavior.

One word of warning regarding load and stress tests: If you hammer a cloud-hosted system with those tests, you might end up with a hefty bill. So, at least for any tests that would cost you a lot, I recommend to use a test double instead.

Option 3: Build a layer to interact with the real system

A compromise can be to build a layer that deals with interactions with the real system.

This can be a lightweight wrapper, that is quite similar to a test double, but routes traffic to the real system. That way, you can add logging, decorate test data, and intelligently handle traffic.

The obvious downside is that you have to build and maintain this layer. Whether the time investment is worth it depends on many factors.

Conclusion

Automating tests for distributed systems is complex and comes with many challenges. I hope I was able to give you some ideas that help you on your automation projects.

By no means are the challenges or solutions an exhaustive list. Those are some of my experiences and ideas that I found useful. I also wanted fo focus on some of the aspects that I have not seen discussed often, if at all.

If you would like to discuss some of those (or other) topics regarding testing distributed systems, please reach out. I would love to hear from you!

(Contact information can be found on the Home page)