Imagine you’re building a multi-story complex with lots of tenants on each floor, and you need to install an elevator. Now, when it comes to choosing that elevator, you find one with a safety plaque that says, “This elevator should carry more than 1,000 kg. We haven’t tested it, but trust us, the cables are strong.” You’d probably think twice before stepping into that elevator, right? Especially if it doesn’t mention following any safety regulations.
Now, If this were a microwave oven and the manual says it “should last for more than 1,000,000 cycles, but we haven’t actually tested it.” You might not disqualify it just for that reason because, hey, microwaves are relatively easy to replace, and if it fails, it’s not a big deal.
If something is critically important and expensive to replace, I want to be confident that it’s going to work well under the circumstances in which it will operate. However, I have seen plenty of software that’s critical to somebody that failed to specify, let alone prove, how it would perform in the circumstances and the environment in which it would be executed.
My assertion is this:
Any software product should have documentation, observability, and performance testing whose rigor and detail is proportional to the importance of the software, and particularly proportional to the result of failures.
For example, if a piece of software is mostly a convenience, much like an elevator, find rigorous testing and documentation to be low in value. But here’s a thought: an elevator outage can trap someone unable to use stairs. And the risks of an elevator malfunctioning mid-operation? They’re grave. Drawing a parallel to software, even what seems ‘convenient’ can hold significant weight in terms of impact. And while there’s a perception that such testing and documentation are bound to be complicated, comprehensive, and therefore expensive, that’s not always the case.
Balancing Time Constraints and Quality: The MVP Trade-off Dilemma
Going back to the elevator analogy, I would want my elevator to be developed and installed quickly to keep my tenants happy. My elevator company needs to develop and market their latest products fast so they can stay ahead of the competition. Trade-off is necessary to meet the time pressure of delivering the “MVP” – Minimum Viable Product. Essential features, like getting from one floor to another, matter more than extras like elevator music. Defining the MVP relies on the malleability of software. You aim to build that elevator as swiftly as possible while maintaining the ability to replace virtually anything in the system at a later time.
When under time pressure to deliver an “MVP”, sacrificing things that do not directly contribute to the behavior of the system is an attractive trap. It’s easy to put aside work that doesn’t directly touch the user.
“Benchmarking/load testing, documentation, observability and monitoring” are things that are helpful to predict the potential failures that can occur when a product is in use. But when overlooked, it is difficult to see what assumptions are being baked into the system.
I firmly assert that benchmarking and load testing must be integral to any team’s definition of “Minimally Viable”. This is not merely a recommendation; it’s foundational for ensuring the product’s robustness. It’s all about making sure that the elevator works like a charm and keeps everyone safe.
“Minimal” gut-check benchmarking
“If debugging is the process of removing software bugs, then programming must be the process of putting them in.”
Edsger Dijkstra
When you’re under time pressure to deliver a “Minimally Viable” software product that can generate market value, your team might be willing to accept a certain degree of dependability trade-off in exchange for releasing the software sooner. However, without performance testing, you lack confidence in the software you’re developing, making it challenging to determine what exactly you’re sacrificing. To establish the threshold for “Minimum Viability,” it’s crucial to demonstrate, within a reasonable range, where the system stands in terms of dependability.
Many applications, especially end-user-facing web applications, can be distilled into a concise series of steps that typically follow this pattern:
- Input Acquisition: Gather input from external sources, like when a user enters data into a form or clicks a specific button.
- Input Parsing: Process and format the input to align with the application’s domain, for instance, converting all dates to the ISO-8601 format and converting them to UTC.
- Resource Retrieval: Retrieve relevant data resources based on the input, which might involve making database queries or calling downstream services.
- Data Processing: Manipulate both the input and resource data together, performing actions such as date comparisons, data sorting, or arithmetic operations.
- Result Storage/Return: Store the results of the processing and/or return them to the user.
These steps embody the core processes of many applications. Proper performance testing of each step ensures that the software maintains its dependability and meets the criteria for “Minimum Viability,” especially when under tight deadlines.
Putting Your Code to the Test: Simulating Real-World Performance
Often all of those steps can be simplified into a single high-level code operation, like making a `POST` request to a controller or executing a high-level function like ‘handle_user_input.’ Let’s briefly set aside powerful benchmarking and load generating tools. Instead, imagine we just put our `handle_user_input` function in a loop. What would happen? How closely would that ad-hoc loop resemble the actual conditions when the code runs? If your program is supposed to run in a single-threaded environment like Node.js with a small number of concurrent users, doing a ‘Promise.all’ over a list of 50 ‘handle_user_input’ callbacks would give you a decent approximation of how your system will perform in its intended environment.
Is it an exact match to the final setup? Probably not. Will the database connection be as fast in production? Definitely not. However, by running that loop 10,000 times and recording the time taken each time, you gather substantial evidence on how your system performs in a particular environment. This is invaluable! Now, you can be more confident that there isn’t an obvious bottleneck, reducing the chances of having to overhaul the code soon.
To draw a parallel, consider if an elevator had a plaque inside that said: “This model elevator underwent 10,000 tests in a controlled environment with a brand new elevator, each time carrying 1,000 kg, and it never failed.” You’d feel much more comfortable stepping into that elevator, wouldn’t you?
Taking it a step further, if your system has a pre-production environment resembling the production environment, you can run the same script against it to validate the system. Document the results in a plain-text file within a `benchmarks` folder in your project; it’s like creating your own elevator test plaque, even if only those with access to the source code can see it.
Now, let’s recap what we’ve achieved in a few hours:
- Identified critical code paths expected to handle the majority of the system’s workload in real-world scenarios, such as key requests to a controller or important ‘handle_user_input’-like functions.
- Considered the level of concurrency expected in the production environment, including simultaneous requests to that controller.
- Evaluated the production system’s environment and resources compared to lower-level environments. Do multiple instances of the application run? Is there parallelism or multi-threading? Are these configurations consistent across production and pre-production? Are there any unusual networking constraints?
- Designed and implemented a script simulating real-world system usage.
- Stored the script’s results in a location accessible to the development team, providing valuable insights into the system’s performance.
This modest investment gives you a better understanding of your system’s performance. It readies you for making informed trade-offs, grounded in a clearer comprehension of the variables at play.
But that one-off script doesn’t make us feel confident
Great! If you feel that your system is too important or too complex for a quick “lunch-time loop gut-check” to verify its dependability, then you might be onto something. If this does not convince the users, clients, or other engineers of its dependability, then I assert that there is an even greater burden of proof on the system’s engineers to prove that it can handle its expected environment and load. They don’t let just anybody build a commercial airliner.
Now, if you reach this point after your initial gut-check test, you can use your findings (i.e. some important code paths, a rough idea of scale, a rough idea of concurrency goals) to help design more comprehensive testing and documentation that makes everyone confident in the system’s reliability.
More complex? More critical? I Need more confidence.
Software should have documentation and performance testing whose rigor and detail is proportional to the importance of the software. If you’ve been convinced of that argument by now, and a “gut-check benchmark” isn’t proportional to what you’re working on, then your system deserves the attention of some more powerful tools. For getting started with environment agnostic performance testing I would recommend k6, and for writing tests that can increase your confidence in the dependability of your software with respect to the inputs that it can handle I recommend learning about property-based testing, for which libraries are available in many popular programming languages (e.g. TypeScript, Elixir, Rust).
Software has become something that billions of us depend on every day, in nearly every aspect of our lives. We should be building software that we’re confident will meet those needs and expectations, and proudly display our performance testing “elevator plaques”.
Performance testing isn’t just a tech jargon—it’s the backbone of dependable software. How are you ensuring your software gets delivered consistently? Dive deeper into advanced testing techniques with us.