Any organization that’s involved in software development is constantly working to improve – from the overall process to the line-by-line code. There’s an overwhelming selection of measures one can choose to guide this process. In 2020, Google’s DORA (DevOps Research and Assessment) team published four metrics they found most impact software organization performance – and added a fifth in 2021:
- deployment frequency
- lead time for changes
- change failure rate
- time to restore
- reliability
By keeping a close watch on your DORA metrics, your organization can see an increase in efficiency and the overall quality of your software delivery model. However, leaders working with these five metrics might still wonder how to get the best ROI for efforts to improve these metrics. We’ve found several practices get the most bang for the buck in managing DORA metrics.
Deployment Frequency
Integral focuses on tools to increase your deployment frequency safely. Without the right management and controls, increasing your deployment frequency can result in a poor product. Here are some ways of handling deployment frequency that result in a faster speed to value while protecting quality.
Continuous integration and continuous deployment are key.
At Integral, we practice continuous integration and continuous deployment (CI/CD). Continuous integration means we are merging new code to a main branch after every user story is done and initial testing is complete. We automate the merging and testing so that we have confidence our code will not introduce breaking changes and that all the code is deployed seamlessly.
Integral’s continuous deployment strategy involves automatically deploying newly completed code to a staging environment in preparation for release. Automatically deploying reduces instances of human error and frees up the team to focus on other value-added tasks. If an issue is found in automated deployment, it is addressed by the team immediately. This way, we are constantly practicing our deployments, and they don’t become “big bang” events with a lot of stress and troubleshooting.
A little extra time on TDD gets faster, higher-quality results.
We practice Test Driven Development (TDD) to write code. We write a test first, based on the story, knowing it will fail. We then write just enough code to make the test pass. Afterward, we refactor the code while making sure the test is still green. We do several iterations of this until the requirements of the story are met. This enables a higher confidence in our changes. We know the code works because it was meticulously and methodically tested. Therefore, we will not hesitate to deploy it. We spend less time retroactively looking for issues before deployment.
Lean delivery reduces frequent deployment risks.
At Integral, we break up work into the smallest possible changes that provide value to end users. This increases deployment frequency because smaller user stories are completed and ready to deploy in hours, not days or weeks.
According to DORA, high-performing teams release between once a week and once a month – we accomplish this easily with one or two-week sprints and a production release at least once at the end of every sprint. This makes deployments a common, regularly managed habit instead of a high-risk, large-overhead event.
Lead Time for Changes
Lead time for changes is a measurement of how long it takes a team to go from coding a feature to that feature being released to production. Elite teams do this in a matter of hours, and low-performing teams in six months or more.
CI/CD investments pay off.
CI/CD takes time to implement, but it’s a vital investment. Once your CI/CD is implemented, your deployment processes will be stable and resilient. That stability is the foundation to build low production deploy lead times. Strong CI/CD gives confidence that each new change can be moved quickly (often immediately) to production if it passes all automated pipeline checks.
Lean agile equals more value sooner.
The traditional waterfall planning method, placing all designs up front, is too rigid. Lead time for change is increased as the team grapples with learnings gathered during the development phase – and the redesign & replanning required to address them. As mentioned earlier, Integral works in one or two-week sprints. We plan out user story details only a few weeks in advance. This way, in each sprint, we are able to apply previous sprints’ learnings to adjust designs and requirements for upcoming stories.
At the same time, we aim to deliver new features at the end of every sprint. We do this by breaking up work into the smallest valuable units and working with product leads to prioritize the next most valuable features to deliver, then focus on those until they are completed. This creates a virtuous cycle where valuable new software is created every sprint, with learnings related to it, and those learnings feed into improvements in how the next features are delivered.
Testing reduces risks of shorter lead times.
TDD and automated testing also reduce the lead time for changes. When an Integral dev introduces rules into a complex, interconnected system, tests are written to validate there are no regressions. Tests run before and after merging code in the automated continuous integration pipeline, giving devs high confidence in their changes. They spend more time developing new features instead of investigating bugs and regressions. As an added bonus, tests are codified expectations around how the software should act. Reading the tests is built-in documentation that gives developers context as to how the code should behave. If the defined behavior needs to change, simply change the expected behavior in your test cases and you are well on your way to stable new features!
Pairing increases team confidence to deploy.
Pair programming is another practice that reduces lead time for change. By increasing the team’s confidence in the code base and code quality overall, pairing makes us more confident every change is ready for production.
Each individual dev brings their knowledge and skills to the pair. They learn each other’s skills by collaborating closely. When two people are thinking about how they will solve a problem, their blended solutions lead to better code outcomes than when a developer works alone. Pairing also means there is a built-in code review, so fewer changes are requested after the code is submitted for review by the rest of the team.
Daily pairing can also eliminate knowledge silos on the dev team. In the classic trope of a developer suddenly winning the lottery and leaving her job, her pair partners still have all the context to carry on development changes – because she had never written code alone. The team is only slowed down by lost capacity – not lost knowledge about the code base – and can quickly onboard and train up a replacement dev, whose learning is accelerated through pairing!
Change Failure Rate
Change failure rate is the percentage of deployments that cause a failure in production. No one likes to have their software fail their customers, so it makes sense this metric is on the list.
Early failure and learning prevent production problems.
Integral has a philosophy of failing fast. By strategically seeking out low-risk-to-fail spaces, we greatly reduce how many failures make it to production. We also intentionally do this with the riskier problems first, so that we get through problems and blockers earlier on and address them, so production deploys are high quality.
For example, let’s say new code is merged into the main branch to be deployed. We fail fast when the automated test suite runs, and we realize there is a failing test. We immediately address the failing test and fix it. We might also release several parts of a complex and challenging feature to a lower environment, realize problems in our design, and quickly make adjustments to improve it before we release it to production.
Similarly, we slice and prioritize work to find failure points and gather learnings earlier. For example, in building a checkout experience, rather than seeking to build an entire guest and authenticated experience and take it to production together, we will first build a guest checkout and get it into production – confirming our integration with the payment processor works and the experience handles traffic well. We can then use learnings from that and real user data from guest checkout to inform how we add the ability for users to create an account, save cards, checkout with their wallet, and view transaction history.
Lean agile, TDD, and pairing make failures smaller and cheaper.
Remember our practices of making small, incremental changes and deploying to production often? This lowers the change failure rate as well. If we only changed a few files to implement a new feature and then failed integration testing, we have a much narrower list of possible failure points. If you’ve spent weeks or months developing related features before releasing them, you likely have hundreds of changes or more. It will take you much longer to identify the breakpoint.
At Integral, we find writing code in a TDD style also greatly improves the change failure rate. As we code, we fail fast at a unit test level, fix the failure, and continue developing. The unit test suite and integration test suite provide feedback quickly, so issues are highly likely to be found in lower environments long before production deploys.
Pair programming also helps developers get feedback in real-time before bugs can arise. We humans can have tunnel vision and unchecked assumptions while working alone. With a pair by your side, you can quickly help each other see your errors. Pairing, therefore, catches failures during coding – reducing the change failure rate.
Time to Restore
Time to restore measures how long it takes an organization to recover from a failure in production.
A whole team qualified to solve the problem.
Try as we might, there is always a risk of bugs being discovered in production. When this happens – due to TDD and pairing – we don’t have to depend on one expert developer to solve an issue – instead, many, if not all, devs on the team paired in building a feature (and/or reviewed the code in a pull request). Everyone is also able to use the test suite and change log to see which changes might have caused the problem. The team has a sense of collective ownership, so there is no blame assigned to any individual – and there are many hands fully equipped to quickly debug and resolve the production problem.
No haystack; just pick up the needle.
For many teams, production issues are found only after many, many changes have been made. They’re looking for a needle in a haystack, trying to find and fix the root cause. For Integral, small incremental changes that lead to frequent deployments also have a positive impact on time to restore. Since we deploy to production every sprint, you have fewer changes to sift through to resolve a production bug.
Reliability
Reliability was added to DORA metrics in 2021. It broadly encompasses measurements of availability, latency, performance, and scalability. Unlike other other points, reliability does not have concrete measurements defined by the DORA authors. Instead, teams are asked to define their own benchmarks in reliability because each product is unique.
While the other four metrics are about performance while building the software, reliability is about performance while operating the software. Reliability was added to DORA to foster a collaborative relationship between the teams responsible for delivery and operations.
Making reliability meaningful and actionable in your context.
We work directly with client leads to define which reliability measures best fit the system we’re building and keep it aligned with their vision, goals, and business needs for the software. We then set up systems to guide setting up lean reliability tooling, discussed more below.
We also collaborate directly with client DevOps and SRE teams to ensure the product is compliant with all organizational standards. We also share our early reliability findings with these teams and work with them to define plans to improve as the product grows. This creates high accountability and support for the product’s reliability.
Lean reliability to support lean delivery.
As we build software, Integral includes MVP reliability tooling alongside MVP software. We don’t want to slow down the other DORA delivery metrics, building complex observability and operational tooling – particularly when the software doesn’t yet have the traffic to justify this. Instead, we focus on smaller investments in reliability.
Initial load and performance tests give high confidence the software can handle initial traffic and scale up as needed. These tests can also be automated to run periodically – allowing comparison and trend-tracking on system reliability as more features are added.
These lean reliability measures also provide insight into areas where the software could operate more efficiently and likely failure points during scaling. We work with clients to draft features to address these risks that can then be played once traffic justifies them.
We also advise setting up initial integrations with observability tooling like Splunk or Datadog to track the system’s runtime performance and health – and connecting these systems to automated messaging (e.g., to a team email address or Slack channel). These messages allow the team to respond immediately to reliability issues.
Want the fast lane to better DORA metrics for your organization?
If you are looking to make a positive impact on your DORA metrics, Integral is here for you. Our team of software consultants has the expertise you need to establish and strengthen your product development with the practices discussed above. Get in touch with us at at hello@integral.io.
It’s time to build your great idea.