FD.io Takes Over VPP and Unites with DPDK to Accelerate NFV Data Planes to Outright Nutty Speeds
On the realization that I was not about to quit these blog posts any time soon, my inbound marketing manager felt obliged to impart some worldly advice, such as suggesting I load the title up with trending keywords. He had other suggestions as well -- something about quickly getting to the point? -- but I don’t remember exactly what they were, right now. Being one to rebel against the conventional wisdom of pundits, I added the nuts of my own accord. My esteemed colleagues call me cynical, which is a badge I would wear with pride… if I had a badge… or any pride. Maybe they have a point.
With that, it’s probably not surprising, to you that I took the news of yet another Linux Foundation initiative with a dose of skepticism. Or maybe it was just the fear of another unbudgeted membership fee hitting one of my cost centers, which I felt. Either way, I headed off to the Fast Data Project website (fd.io) with some trepidation and was not disappointed. Which means I was disappointed, I guess… apart from the doggie logo (named Fido, I assume), which is fantastic. I understand that’s my problem -- the techie introductions were designed for people who are actually “skilled in the art,” to borrow a phase I would read a few times on this journey. Put another way, they know what they are talking about. Naturally, those “people” didn’t include me, so I thought I should change that. Holding on to the tiny morsel of knowledge I’ve already bestowed on you in one of my previous blog posts “Accelerating the NFV Data Plane,” I set off.
The primary problem Cisco set out to solve with the development of vector packet processing (VPP) in 2002 was one that I touched on, oh so very lightly, in that exact blog post. It must have done its job well because in 2004 a patent was filed and granted seven years later.(1) Right up until the February 2016 formation of FD.io, the technology was promoted but considered (and clearly detailed as) Cisco proprietary. I would assume that this is now subsumed by the standard Apache License 2.0 patent grant and retaliation clauses,(2) which FD.io operates under -- but don’t take my word for it, I only lawyer on the second Tuesday of every other month.
The fact that VPP is very much related to the concepts I originally explored in that previous post is, in hindsight, probably not surprising given DPDK’s critical role in the NFV data plane and Intel’s obvious interest in FD.io. Given that Cisco first explored introducing VPP to the open source community integrated within a DPDK-accelerated Open vSwitch and the fact that OVS is what most people think of when the topic of NFV data planes come up, that’s what I will focus on here. I should make it very clear, however, that the FD.io initiative is not only applicable to OVS but also other virtualized forwarding engines such as the standard Linux router distributions or even the Click modular router, hence the interest and project sponsorship from Metaswitch. Indeed, VPP is also applicable to many architectures (x86, ARM, and PowerPC) and deployment environments (bare metal, VM, container).
Supported by the data plane development kit’s poll mode drivers (PMD) and ring buffer libraries, VPP aims to increase forwarding plane throughput by reducing the number of misses in flow/forwarding table caches while replacing standard serial packet lookups with a parallel approach. To understand what all that means, we have to make a brief sojourn to Cache-land, which regardless of how good the brochures make it sound, I would not recommend for a vacation.
It was David Feldmeier who first proposed using caching techniques in routers back in 1988,(3) but you’ll be glad to hear I’m going to fast-forward two and a half decades. If you recall from the aforementioned post, we referenced the addition of a megaflow cache to OVS in the version 1.11 timeframe. Simply put, the incumbent microflow cache suffered from an inordinate number of “misses,” where the data required to process the ingress packet was not found and therefore the entire OpenFlow table (the cache of last resort) had to be consulted. This is not completely surprising as cache memory is inherently small because of both the relative cost of the component in which it resides or which it uses (i.e. the embedded CPU L3 cache, off-chip SRAM or customized silicon, such as a ternary content addressable memory/TCAM(4)) and the logical maximum size of the cache itself. It’s too big, and it might disproportionately increase the overall latency when there is a miss.
The Open vSwitch cache (with or without DPDK acceleration)
Short-lived flows or high-entropy packet fields -- those likely to have differing values from packet to packet -- kill caches, hence the introduction of the megaflow (aggregate) cache into OVS in an attempt to turn mice (flows) into elephants. To explain, I’m going to have to use some cache terminology, which I’d be lying if I said I wasn’t excited about. Let’s start from scratch: An empty cache is “cold,” resulting in misses on all queries, of course. The cache “warms up” as those misses are subsequently used to populate it, at which point the cache is said to be “warm.”(5) A warm cache should be resulting in an appropriate number of query “hits,” with either simple first-in, first-out (FIFO) methodologies or, more appropriately, least recently used (LRU) or least frequently used (LFU) algorithms deciding how old entries should be replace by new one. This replacement policy is actually more critical than I’m giving it credit for, here, as a high cache churn rate could be due to poor replacement algorithms. A more likely reason, though, are those pesky short-lived flows, with each new addition resulting in a miss and the possible replacement of a long-term flow entry in the cache with one that is never seen again. An incessant eviction of useful data is (superbly) referred to as “cache thrashing.”
In CPU parlance, this functionality would leverage the information instruction cache (i-cache), which is how VPP and consequently the FD.io refers to it in all current documentation. It also mentions a supporting data cache (d-cache), used to store pre-fetched data needed to support the i-cache. VPP primarily favors the i-cache, although there are some advantages obtained with increased d-cache efficiencies as well. As I have a tendency to do in these posts, for simplicity,(6) I’m going to ignore the d-cache.
Those of us who have been working in NFV, for the last few years, will be very familiar with some of the terminology used by FD.io. While VPP naturally came first, it employs very similar nomenclature to that of NFV Service Function Chaining, (or vice versa, I guess) for reasons that, I hope, will become clear as you read on.
With the bottleneck being the cache, even in the most highly tuned, all-user-space, DPDK accelerated environments, the switch pipeline operates in a serial mode, handing one packet at a time. Even if there is a nice big DPDK FIFO chock full o’ packets, they are sent through the “forwarding graph” individually. In computing parlance, this is called scalar processing. The forwarding graph, which essentially defines the forwarding operations for each given packet, comprises a number of “graph nodes,” each with a different role to play in processing or directing the packet.
Classic scalar processing of an IPv4 packet
After the packet is plucked from the receive queue, the Ethernet headers are decoded and the underlying packet flavor identified by way of the EtherType field. While there are numerous next-hop options, 0x0800 tells us this is an IPv4 packet and therefore should be next forwarded to the IPv4 validation graph node, which performs the appropriate checksums and verifies the TTL has not expired. Next is the forwarding determination, where the IPv4 forwarding graph node performs lookups in our microflow cache and, if required, the megaflow cache. Two misses and it’s the slow road to the full OVS pipeline before the packet heads to the transmit graph node with its new details. Phew.
Now thoroughly exhausted by everything we have to go through to process a single packet, I’m beginning to see the value of FD.io. VPP operates on a simple principle with a (typically) scientific name: temporal locality -- or locality in time. In terms of application flows, this phenomenon notes the relationship between packets sampled within a short period of time and the strong likelihood that they are similar, if not identical, in nature. Packets with such attributes would reuse the same resources and will be accessing the same (cache) memory locations. To exploit this characteristic, VPP does what the name already implies. Rather than working on single packets in a serial manner, a VPP operates simultaneously on an array, or collection, of packets.
It goes without saying that I don’t expect anyone with a detailed knowledge of FD.io to be reading this post, so the following representation of the VPP process has been dramatically simplified to accommodate the rest of us. We are, however, no strangers to DPDK and as I have already alluded, FD.io relies heavily on its collection of tools when operating in Intel architectures or simply on x86. Dramatically reducing CPU interrupt overheads, DPDK PMDs periodically interrogate its large FIFO ring for packets awaiting collection. Rather than just grabbing the packet at the front of the line, however, the VPP engine takes a chunk of packets up to a predetermined maximum of, let’s say, 256. Naturally, the vector itself doesn’t contain the actual packets but pointers to their locations in a buffer. It is, however, easier to think of them as the real deal for our purposes.
The “superframe” of N x packets, as it has been referred to, proceeds to the first graph node, where the Ethernet header is decoded and the EtherType is identified, as previously. While our theory of temporal locality suggests that the EtherType will be identical across the vector (i.e. IPv4), naturally there is a chance a group of diverse packets (i.e. IPv6) made it into the superframe. If this is the case, the forwarding graph forks and the superframe is partitioned into two “subnets,” each with a distinct next-hop graph node.
Vector processing of IPv4 packets using DPDK
After IPv4 validation, the IPv4 forwarding rules are applied to the vector. This is where the benefits of vectorization really leverage the temporal relationship of application flows. As a short-lived flow, the first packet in the vector might be used to warm the cache, but the resulting instructions can then be repeatedly executed for each subsequent packet, thereby amortizing the one cache miss across the entire superframe. While our forwarding cache is the best example, there are other i-cache and d-cache lookups across the forwarding graph where these benefits of this amortization are compounded.
VPP technology, as reported in the FD.io overviews, is highly modular, allowing for new graph nodes to be easily “plugged in” without changes to the underlying code base or kernel. This gives developers the potential to easily build any number of packet processing devices with varying forwarding graphs, including not just those supporting switches and routers but intrusion detection and prevention, firewalls or load balancers. Throw in some native support for a OpenDaylight management agent, which FD.io does with its sample vSwitch/vRouter code, and you have the foundation for NFV data plane capable of some pretty insane performance. Just how insane is not yet clear, but EANTC tests commissioned by Light Reading on behalf of Cisco last year(7) pitted their VPP vSwitch against OVS with DPDK. The throughput deltas, as the address mix increased, were substantial -- and I mean over 15 Gbit/s substantial.
If this skeptic is sounding like a believer, you are hearing right. Metaswitch is a proud founding member of the FD.io. I even happily signed off the Linux Foundation membership fee, when it hit my email inbox… but only because I retired it against our trade show budget code. They won’t notice, will they?
(1) United States Patent No. US7,961636 B1. Barach et al. June 14, 2011, which I used as a reference throughout this post.
(2) …each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work…
(4) TCAM’s are more efficient, being able to perform a lookup in only one clock cycle. SDRAM requires multiple clock cycles, but is a standard CPU component.
(5) The cache is apparently never “hot,” for reasons I’ve been unable to ascertain.
(6) Really, it’s because I couldn’t work out for the life of me what the d-cache was actually doing, but don’t tell anyone that. No, I have no right to publish a post like this, and yes, you probably should have my job.
Simon is the Director of Technical Marketing and a man of few words.