This is why a new approach to pure software packet processing will make you rethink your choice of 5G UPF
Naturally, the first thing I worry about, before I start actually writing a blog post, is what cutesy title to give it. Suffering the sort of writer's block that would make Jack Torrance seem sane, I ultimately took to the Internet for inspiration. Fortunately, there are more blog posts about writing blog posts than there are blog posts. Probably. Anyway, it didn’t take me long to find some research that suggested that the two most effective phrases for blog headlines are “this is why” and, by a landslide at number one, “will make you.” Excellent. Title sorted. So, with the important stuff out of the way, I can jump into the topic at hand. After all, all play and no work makes Simon a mere toy.
Now, I have written about data plane acceleration techniques in the past, including this one on SR-IOV and DPDK (2015), which I followed up with a look at benefits afforded by programmable Open vSwitch (OVS) data planes employing P4 for Programming Protocol-Independent Packet Processors (2015). Later, I covered the attributes and advantages of Vector Packet Processing (VPP) and its exodus into the open source arena with the Fast Data Input/Output (FD.io) Project (2016). While these posts are getting quite long in the tooth, they remain applicable (enough) not to motivate me to rewrite or update them here. Plus, these topics are now far more mainstream and familiar than they were back then, so it’s probably not warranted anyway. Moreover, while SR-IOV and DPDK remain the predominant mechanisms for efficiently getting packets off a physical wire and into a virtualized packet processing engine, the others don’t necessarily represent the most effective mechanisms for provisioning programmable pipelines or delivering pure software packet processing in cloud-native core networks.
Pure software-based 5G network switching and routing
While software switching is the obvious (only) answer to intra-virtualized server traffic, our interest in these techniques extends far beyond even data center traffic flows or Network Functions Virtualization Infrastructure (NFVI) supporting Virtualized Network Function (VNF) workloads. Making a rapid beeline toward the cloud, emerging 5G mobile architectures are demanding a high degree of dynamic, isolated end-to-end virtualization (aka network slicing). More than ever, we therefore need to seriously consider the use of virtualized x86 platforms, to support core network switching and routing requirements. Specifically, this means the 5G Core User Plane Function (UPF), which is the heir apparent to the PGW-U and SGW-U, for anyone who dares dip their toe into intermediate Control and User Plane Separation (CUPS) options.
These large platforms are currently the bastion of custom silicon and closed Network Operating Systems (NOSs), so one would reasonably assume that the only alternative is merchant silicon, in the form of a white box switch supporting some Open Networking Linux (ONL) derivative. While this would reduce the all-important cost-performance metric to a small extent, the fact that these remain fixed form-factor, hardware-centric devices means that the flexibility to granularly deploy data plane functionality where and when it’s required and in near real time would remain an unattainable goal. Ultimately, the adoption of such hardware-centric openness could be perceived as a big risk for little gain.
So, we should instead consider a Commercial Off-The-Shelf (COTS) server platform with either hardware or OS virtualization -- Virtual Machine (VM) or Linux Container (LXC), respectively -- as a network switching and routing platform, over a classic proprietary hardware or open white box switch. Since it has been deployed to support hardware virtualization for a decade now, we can look to OVS as our benchmark for software-based network switching performance, but there are other more recent options to consider -- and not just VPP/FD.io. Depending on the type of pipeline stages employed, switching systems like OVS and FD.io are categorized in one of two ways: data-driven and code-driven. Unfortunately, even with the best x86 data plane acceleration techniques the industry can currently offer, the performance levels that can be achieved in software based implementations of these systems falls far short of what can be attained using dedicated hardware platforms.
Data-driven and code-driven switch implementations
The OpenFlow-controlled Open vSwitch employs a data-driven switching pipeline, which represents the chain of processing elements (stages) where the output of one is the input of the next. The stages are represented in terms of a data table, against which packets are matched. Matching each packet’s header fields against a set of patterns and acting according to the most relevant match, better known as match-action, is the fundamental principle of packet processing in switched networks.
Data-driven switches have a single parser that propels packets through each stage of the pipeline. Starting at the leading edge, the parser is responsible for sequentially identifying and extracting the appropriate header fields in a packet for processing by subsequent stages. Packet parsing is recognized as a primary bottleneck in switches, owing to the diversity and complexity of the packet headers employed in internetworking. Not only does the length and format of packets vary, it’s not uncommon for each one to have eight or more headers. This is compounded by the increased adoption of tunneling protocols and stacked shim layers, such as VLAN, MPLS and GRE, thereby making parser design a complex proposition.
While, in theory, the parsing engine in data-driven switching implementations should be more efficient, most implementations are impeded by the fact that they pull out every supported protocol -- even if the subsequent stages on the pipeline don’t care about them -- which has a detrimental effect on overall performance. Implementing a parser as a single lump of code is also not very agile in our increasingly microservices-based world, in that the switch can only support new protocols or modify existing ones by updating the entire engine.
Yet, there are important advantages to data-driven implementations, in that the parsing engine has visibility into the entire pipeline. This results in the potential to optimize packet processing, taking the product of many pipeline stages and turning them into just one lookup table, thereby increasing the number of stages without increasing latency. But this can only be achieved in select circumstances and is not a general feature of data-driven switches. While a data-driven pipeline with a large amount of stages is fast, alternative code-driven switch implementations, with fewer stages, have them beat. At first glance, data-driven systems should afford greater flexibility, but the very nature of their fixed parser limits that. Essentially, the fact that they can only operate on fields that are recognized within the header makes it more difficult to implement new match-action classifiers.
High-level depiction of data-driven and code-driven switch pipelines
Code-driven switch implementations have been around since the days of the (Ethernet MAC-learning) Linux Bridge, circa 1991. Introduced in 2000, primarily for academic purposes and therefore with little focus on performance, the Click Modular Router is also a great example of a code-driven switch implementation. VPP/FD.io is a more contemporary example of a code-driven switch architecture, as is the Berkeley Extensible Software Switch (BESS), which was first introduced in 2012.
As the name suggests, in a code-driven pipeline, the switch executes a series of code fragments: Loosely coupled pieces of regularly used and reused code that will not run by themselves but are part of a complete program. Because each stage is implemented as a distinct piece of code, it can perform any function one might desire. Or not. This intrinsic flexibility is exactly why code-driven switch implementations are preferred by developers over implementing their programs as data tables. Not only is that boring, it is easier to create new plug-ins for code-driven stages. A problem arises, however, in that every time a developer wishes to perform a slightly different function, at any stage, a new code module must be written. This is evident in more mature code-driven switch implementations, such as Click, where the number of modules has ballooned from the original 130 to over 500, today. In such open implementations -- where modules submitted to a project have multiple sources with sometimes questionable levels of developer experience -- functionality, reliability and interoperability are also often called into question.
Unlike data-driven switches, code-driven pipelines perform packet parsing at each stage and therefore do not, historically, offer the ability to be optimized. This inherently results in an increase in packet-processing latency as the number of stages increases. Although VPP reduces overheads by batching multiple packets within a superframe prior to match-action classification, loading the code and data required for each stage in the pipeline slows processing down significantly for larger parse graphs. Many VPP benchmarks corroborate this fact in that the numbers they publicize indicate around a 50 percent reduction in throughput between a simple pipeline, like IPv4 routing, and something more complex, such as an IPv6 tunneling overlay.
FD.io CSIT VPP performance testing: IPv4 verses IPv4/IPv6 overlay tunneling
Counter to data-driven alternatives, code-driven switches with empty pipelines must only handle I/O, copying the packet from the ingress to the egress port, resulting in blazingly fast processing. This might be why we see such a disparity in performance numbers from (shall we say) “less than impartial” benchmark testing results.
Driving me up the wall
So, it’s obvious that there are some rather stark pros and cons to both data-driven and code-driven switch pipeline implementations. Ideally, then, we should be looking at a switching architecture that blends the individual strengths of both these otherwise disparate implementations, while mitigating the apparent weaknesses. We have, of course, seen industry groups make moves toward fixing the inadequacies of their preferred pipelines. Indeed, this is exactly what my aforementioned blog post (Can P4 fix OpenFlow) was about. When OpenFlow is busy populating the lookup tables, P4 could define the packet parser, programming it with only the protocols it needs, whilst telling each stage what header to look at.
As I said back then, the addition of P4 to OVS was seen as being in the sweet spot between the inflexibility of OpenFlow-controlled data-driven systems and the generality of code-driven systems, such as Click. While P4 mitigated some data-driven shortcomings, it was clearly still targeted around data center infrastructure white box implementations and intra-server (east-west / VM-to-VM) OVS applications. And that’s just fine. Let’s not forget that, at the time, OpenFlow and data center infrastructure switching were smack-bang in the middle of their hype-cycle. As the (direct or indirect) architects of OpenFlow, the originators of P4 were looking more closely at preserving their prodigy, rather than developing new solutions to other emerging internetworking alternatives. Indeed, with 5G in its infancy, it’s likely no one was thinking about the prospect of migrating classic hardware switch/routing solutions to pure cloud-native platforms at the time.
Supporting dynamic end-to-end network slicing in infrastructures employing hardware (VM) or OS virtualization (Containers) demands a 5G Core network architecture where even the user plane functions are pure software-based, cloud-native solutions. So, we must look to combine the benefits of a low per-stage overhead and a common parser implementation, as we see in data-driven switches, with the low, fixed per-packet overhead and pluggable modularity of code-driven systems. Conversely, we must strive to avoid the pitfalls of data-driven pipelines (that very same gnarly fixed parser) and the tendency for code-driven solutions to dramatically increase packet latency and therefore reduce throughput with complex, multi-stage pipelines supporting packets with numerous headers. This is particularly important because that is exactly what we expect in the Radio Access Network (RAN).
While one might assume that P4 can help code-driven systems in the same way it did for data-driven switches, the programs can only act on a fixed set of tables and each one must be written for every specific architecture. This makes it practically impossible to write a single P4 program that will operate across both container and VM environments with a varying number of interfaces. It is also difficult to develop complex packet pipelines containing hierarchical encapsulations/decapsulations and multiplexing. Again, P4 was really designed for (white box) hardware platforms that employ merchant switching silicon, TCAMs for maintaining match states plus other processors for running the NoS, routing stacks, signaling protocols and management.
Toward greater composability of highly optimized pipelines
As is evident in my previous writings, few at the time lauded VPP and P4 more than I did, but the industry has moved on. COTS server silicon has continued its exponential increases in processing speed, virtualization techniques have become more lightweight and networking software is becoming ever more agile. We should, therefore, stop trying to patch existing switch implementations to leverage these underlying evolutions because they continually bring their excess baggage along for the ride. It is quickly becoming evident that to truly deliver a cloud-native network-access, aggregation or core switch/router capable of delivering the same performance as today’s hardware platforms, we must fundamentally rethink how they are built.
Granular composability of processing pipelines is paramount in today’s dynamic, highly automated network architectures and network functions cannot be hampered by unyielding individual elements. But with increasingly complex user plane transport techniques it is also unacceptable to deploy a system where performance degradation occurs simply when faced with multiple encapsulated packet headers – the norm, in modern mobile communications. Still, there is an advantage in that these high-level encapsulation techniques are reasonably consistent and common across the majority of packets in the edge and core.
A typical code-driven pipeline would require six classifier steps, making it incredibly inefficient -- especially given the predominance of user plane Tunneling Packet Data Units (T-PDUs) with common header fields. The overhead involved in moving a packet through a pipeline like that (loading the classification data and testing for actions) is significant, thereby resulting in significant traffic latencies. We can’t reduce the number of distinct headers involved, but we can optimize the packet processing pipeline by collapsing multiple classification steps -- effectively creating a single classifier that performs the role of two or more (traditional) classifier stages. This was the first step Metaswitch took when we set out to tackle the issues of performance and programmability in cloud-native virtualized switch router network function design.
Parser graph of a classic code-driven pipeline verses our optimized, composable pipeline
Naturally, these collapsed stages must be handled efficiently by simultaneously classifying multiple fields from different packet headers, potentially resulting in millions of matches. With the majority of traffic processed using a single match-classify stage, less common packets are placed on a longer pipeline with additional processing stages, which must sometimes inspect the same header fields again. As these are typically less critical packets such as ARP and ICMP, however, the effect of the increased latency is negligible.
While these pipeline optimizations can be statically created during the development of the application-specific processing graph, in instances where the structure cannot be predicted in advance, the composable nature of this solution enables them to be implemented dynamically. This is why we refer to our overall solution as the Composable Network Application Processor, or CNAP for short. Naturally, CNAP avoids the pitfalls of a data-driven switch’s upfront parser by performing that function at each stage. Plus, unlike a classic code-driven system, CNAP significantly reduces the time it takes to load the code onto each stage, implementing a just-in-time parser which collects the fields required for each classifier only as they are required.
CNAP also makes use of other modern optimization approaches, such as a VPP/FD.io-like packet batching and interleaving technique, where packets are received and grouped into superframes as vectors. This dramatically increases the efficiency of the instruction cache (i-cache) that itself employs the ample and efficient x86 CPU Level 3 (L3) cache memory, which, in general, eliminates the need for a TCAM. Not only does this dramatically reduce cache misses over per-packet processing, as the first packet in the vector warms-up the cache, the remaining are effectively handled for free. No cache pun intended. Naturally, CNAP also enjoys all the acceleration attributes afforded by the Data Plane Development Kit (DPDK), which it accesses via the native Application Programming Interfaces (APIs). Given its limitations, CNAP directly avoids the P4 programing paradigm, instead employing an alternative open API which configures all the runtime components. Either collocated with the runtime components or as a decoupled microservice hosted anywhere within the network, application-specific graphs are configured through an API that is constructed automatically by the CNAP Software Development Kit (SDK) and expressed in a definition document that uses a custom YAML (Yet Another Markup Language) schema. External network automation applications can therefore program the graph without having to operate directly on the packet processing classifying stages of an individual switch pipeline. A P4 cross-compiler enables the P4 programming language to be used when defining CNAP pipelines.
The Metaswitch Composable Network Applications Processor has accomplished the enviable feat of combining the advantages of both data-driven and code-driven switch systems, while mitigating the typical drawbacks of these distinct implementations. CNAP delivers a packet processing pipeline which has a low fixed overhead and is highly programmable but isn’t hampered by an unwieldy element, such as a fixed parser. Leveraging many innovative optimization techniques, CNAP is an extremely flexible solution built in a highly modular manner as a pure-software solution for cloud infrastructures but does not suffer from excessive packet latencies when processing multiple packet headers.
The Composable Network Applications Processor is more than just a fancy title. Supporting any general-purpose cloud computing environment -- public, private or hybrid VMs, containers or serverless -- it is the foundation on which incredibly powerful, highly flexible and exceptionally cost-effective network-access, aggregation, and core switches and routers can be built. With the 5G core architecture demanding cloud-native user plane functions deployed within highly distributed Multi-Access Edge Compute (MEC) infrastructures to support granular network slicing capabilities, a 5G Core UPF built on CNAP would seem to be the new approach that’s really worth you thinking about.
Simon is the Director of Technical Marketing and a man of few words.