Applying DevOps techniques to deliver networking software faster and more reliably
Many of us have been around long enough to remember when it would take months and a team of a dozen engineers to do regression testing on a software release before it went out. Thankfully, here at Metaswitch, those days are a long gone. You might want to make sure you’re sitting down for this part: We can release our software every week and we’re working towards releasing software every day.
Just let that sink in: Complex, mission-critical networking code released weekly.
It’s not as crazy as it might sound. We’ve just been busy taking the techniques of Continuous Delivery from the DevOps world and applying them to the development of IP networking software, such as segment routing. In this blog, we’ll explain how we’ve achieved this to accelerate our delivery cycle.
Our industry is changing with the disaggregation of hardware and software, and we’re leading this transformational shift by adapting the way we develop and deliver our networking software to our customers. We also think network operators can use similar techniques to manage software and configuration changes in their own environments.
Recent research published by Forsgren, Humble and Kim in their highly acclaimed book “Accelerate” revealed that Lead Time and Deployment Frequency were among the key factors that have the biggest impact on software delivery performance.
Continuous Delivery uses techniques such as automated regression checking at various levels and continuous integration where tests are run automatically whenever code is committed to dramatically increase the throughput of software from developer to customer and to significantly reduce the risk of regressions in those software deliveries at the same time. These techniques are most widely used by companies developing and hosting web services, the likes of Amazon, Facebook or Microsoft. Applying Continuous Delivery techniques to networking software is not without its challenges but we’ve found it reaps many of the same rewards.
The result has been transformational for engineering productivity and our ability to release software quickly. Rather than one big release per year, we now have software that can be released every week. Code changes that cause regressions are typically caught within 10 minutes of a developer pushing them to their private branch and the master branch is protected by thousands of tests at multiple levels. And, equally importantly (for us at least!), our engineers are much happier because they’re not spending months running repetitive regression testing cycles.
Here are the key ingredients in a nutshell:
Network level test automation environment
We built a bespoke test environment that allows us to run the full protocol stack with a software dataplane on Docker containers. These can be cabled together dynamically in software to build different network topologies for different tests. We can also dynamically cable in real physical devices as part of these tests. The tests themselves configure the various nodes, verify state and traffic and the same tests can be run with or without real hardware.
Using CICD pipelines
We built Continuous Integration pipelines using GitLab CI that intelligently and automatically runs the most relevant tests at various levels when code is committed. This provides the development team with initial feedback from Unit Tests within 10 minutes and provides a very high level of protection to the release branch consisting of tens of thousands of tests.
Like all good engineering practices, we’re on a journey and we continue to invest in improving our tools and processes. Weekly and even daily software releases are entirely common among web-scale companies, but Metaswitch is leading the charge in applying DevOps techniques to complex networking software.
For more on our network software protocol stacks and solutions for disaggregated network equipment, please download our Network Operating System (NOS) Cookbook.
David is an Engineering Manager within the Networking Software team at Metaswitch Networks. He has particular responsibilities for Engineering Productivity, Automation, Testing and Delivery. David has worked in various Development and Management roles across Metaswitch for the last 18 years.