Writing tests is becoming a big part of our job. If it isn’t yet, I strongly encourage you to push your organization down that path. Why could be the topic of another article.
I think there is a tremendous value in having an efficient test-suite. By efficient, I mean that it doesn’t give much extra work when refactoring and it gives accurate information when something is broken. And by accurate, I mean having as few false positive as possible, as many defects being caught as possible, and as few tests failing as possible for a single defect.
As important as tests are to me, I don’t give as much attention to tests as I give to production code… In my reviews, I tend to have lower standards when looking at the tests. For instance, I won’t ask for a refactoring of the tests as long as they seem to be testing the behavior that just changed. It leads to heterogeneous practices. And on some topics, we simply disagree!
This article will be about a controversial topic and will try to show the benefits of using randomness in your tests. I will also cover some of the downsides too and if you have more points you would like to add, please ping me on Twitter.
The examples in this article will follow a feature and its testing journey. Here is a description of the feature:
We consider the duration of the rental to be the number of 24-hour chunks between its start time and its end time. When a trip spans across more calendar days than its number of 24-hour chunks, we would like to use the pricing of the car for the most relevant days. For instance: if a trip starts at 2pm and finishes at 8am the next day, we would like to consider the pricing of the car for the first day to be from 2pm to midnight.
Here we’ll look at the development of the tests written to test the
#date_range method. This method gives the relevant days we should consider in order to price the trip.
In this context, in order to clarify things between the product owner and the development team, some examples were created and agreed upon before the code was created. Those examples were translated into the following test-cases by the developer:
I rewrote the test names as the ones we had were
Example 2, and so on. They were extracted from a spreadsheet of use-cases the product team gave us.
What you may see here is that those examples describe some use-cases that we believed would be enough to ensure that the implementation was correct: ie to cover all cases. And it actually covered the given specifications correctly. And the implementation was making all tests green. Unfortunately, the whole team forgot about this one:
Because of daylight savings in some time zones, we could have one trip that spans across more than N + 1 calendar days, where N is the number of 24-hour chunks between
ends_at. The first lesson here is to be really careful about the edge cases.
While in this example it does look like an edge case, it was actually a bit more common. We have an extra rule that allows a trip starting from 10am and finishing at 11am the next day to be considered as a one - rather than two - day trip.
The point of the article is to show that without being more clever, we could leverage another strategy to explore the expected behavior and detect that missing use-case from earlier.
generate_context are features that doesn’t exists yet. If you’re interested to work on implementing them, let me know!
Using that kind of approach leads to fewer examples, and to ones that are more meaningful. Now, the team needs to find properties that the subject under test should respect given a certain context.
The product and the developer must, together, come with both those contexts and properties. They force us to clarify our thinking. Here it means that we reformulate relevant days from the original specification. The context and properties forces us to extract the domain related concepts of
time_spent_on which could help to model the problem and maybe lead to a clearer solution.
Random generators can be shared across the application. Custom generators for any value of your domain must be available, very much like factories would be.
If it was that great, everyone would be doing it, right?
In this appoach, we need to use elements from the context (such as
@ends_on) to compute the expected results. What prevents me from making a mistake in both the expected value computation and the production code?
The use-cases approach is simpler to setup and less risky to write because it focuses on a single and fixed context. Even when the context isn’t fixed, we could use constraints on it in order to reduce the complexity of the expected result computation.
In the examples, the arithmetic on start and end dates are the same in term of complexity.
The obvious difference between the two approaches is that the use-cases are really close to reality while the one using randomness forces us to come with well-structured rules and a more generalized approach. Driving the implementation from the use-cases may be more natural for TDD practitioners. The use-cases are needed in order to find relevant properties and contexts. Thus, use-cases are still mandatory in the process.
Using randomness is something that many people are afraid of. They may feel that they are losing control, that their test suite is gonna start slowing them down. Here are two remarks that are deep enough to, maybe, make you reconsider:
Those remarks implie that there are various classes of randomness. One is comming from impure functions either in the tests or in the production code. Those could lead to flaky tests.
Another one, introduced purposefully, which is here to help us to discover failures, to reveal inconsistencies in our thinking, and to detect unexpected behaviour as soon as possible.
Your tests will run on CI and will give you failures. Once spec fails, it isn’t obvious what the generated inputs were. Being able to understand and reproduce a failure is critical.
In the example, the context is lost upon failure. It is simple to get that context and it would give us a good hint as to what’s going on. Here is an example:
I’m also experimenting with a custom pseudo-random generator that would use a different seed for each test and, in case of a failure, would display that specific seed to you. This experiment is a bit raw at the moment but lives in Github’s nicoolas25/fuzzier repository. It would look like this:
When an error occurs, it will output an integer, lets say
12345 that can be used to reproduce the same randomness:
The faker gem provides something similar with
This approach is very similar to property-based-testing. The difference is mostly that we don’t try many input sets on those examples; only one. But because tests run quite often, we end with way more use-cases over time. Solutions like Rantly fully embrace property-based testing and provide more tools including the ability to run a test against many input generations.
Because I see this approach more like an exploration tool, we could try to run a given test many times to be more confident that nothing could go wrong. It would look like this:
Doing that exploration may show you some use-cases you missed and give you more confidence that the properties you specified truly match the requirements.
I think using this kind of approach has multiple benefits:
I wouldn’t recommend this approach for integration testing where the goal is rather to secure well-known paths rather than explore all the possible cases. Also, I think about UI tests as a place I wouldn’t like randomness. You may want to compare screenshots of your application and that would be harder if the content was changing.
But, for components where we need its behavior to be fully described, I would consider this approach. I would consider it in addition to the usual use-cases for some edge cases. It forces me to think more about the problem and to have deeper discussions with the business. It can also point me to cases I didn’t think of.
As I said before, this technique can be a bit controversial and I invite you to talk about this with your team and share your opinion!