Accelerating the NFV Data Plane: SR-IOV and DPDK… in my own words
It is true -- much has been said about both SR-IOV and DPDK in the past, even right here on our very own blog. I see this as a challenge: An opportunity to tell the story of data plane acceleration in a slightly different way. Our copy editor also sees this as a challenge, of course, as she wades through the mass of potential source material searching for blatant examples of plagiarism in my work. Apparently the “sincerest form of flattery” argument doesn’t hold much weight in the writing community.
That’s a shame, as I have much in common with the originator of that quote, Charles Caleb Colton… and not just because I also fled England to escape my creditors. In his book Lacon, or Many Things in Few Words: Addressed to those who think, Charles says that “error, when she retraces her footsteps, has farther to go, before she can arrive at the truth, than ignorance.” Which is my primary reason for writing a post like this: To help me rein in my ignorance before it marches, in true lemming form, straight off the factual cliff. Plus, a couple of extra page views on our website never hurts. Thank you for those.
As either a champion or outright originator of SR-IOV and DPDK, Intel is an excellent source of information regarding both. SR-IOV, however, has its roots firmly planted in the Peripheral Component Interconnect Special Interest Group, or PCI-SIG for short. Building on and standardizing input-output memory management unit (IOMMU) technologies from AMD (IOV), Intel (VT-d) and others, the “Single Root I/O Virtualization and Sharing Specification” was first released in September 20071, around the same time the concept of server virtualization was hitting its stride. Up to that point, I/O virtualization options were strictly software based, with the hypervisor (virtual machine monitor or VMM) permanently residing on the path between the physical NIC and the virtual NIC (vNIC) associated with the virtual machine (VM) itself.
While this is how we typically think of a virtualized machine and indeed had some advantages, such as legacy application support and a general lack of reliance on “specialized” hardware, it also resulted in extraordinarily high CPU utilization. This ultimately reduces throughput and the number of VMs that can be realistically hosted on a single platform -- in some cases to the point where virtualization is no longer viable from a business perspective.
Like with their non-virtualized counterparts, getting packets off the wire for processing in VM environments is interrupt-driven. When traffic is received by the NIC, an interrupt request (IRQ) is sent to the CPU, which then must stop what it’s doing to retrieve the data. As anyone who has kids will understand, the interruptions are endless -- reducing the CPU’s ability to perform its primary computational tasks. In virtualized hosts, this problem is exaggerated even further. Not only does the CPU core running the hypervisor get interrupted, but also, once it has identified what VM the packet belongs to (based on MAC or VLAN ID, for example), it must then send a second interrupt to the CPU core running the VM itself, telling it to come get its data.
That’s a double whammy, but more importantly, that first core is a major bottleneck and can become quickly overloaded. Intel took the first steps towards eliminating this bottleneck and improving overall performance with a technology called Virtual Machine Device Queues (VMDq). As the name suggests, VMDq enables the hypervisor to assign a distinct queue in the NIC for each VM. Rather than sending an interrupt to the hypervisor CPU core first, only the CPU core hosting the destination VM need be interrupted. The packet is only touched once, as it is copied directly into the VM user space memory.
Now that VMDq has removed the bottleneck, interrupts are more fairly load-balanced across all VM host cores. While this has the advantage of keeping the hypervisor in the loop, thereby enabling functions such as a vSwitch to remain inline, the sheer number of interrupts in data-plane-centric virtualized network functions will still have an incredibly detrimental effect on performance. This is where SR-IOV comes in.
When SR-IOV was introduced to the world, we were undergoing somewhat of an interrupt eradication movement. InfiniBand was all the rage, with its remote direct memory access (RDMA) techniques promising to overthrow IRQs… if only it could overcome Ethernet. Like the house, however, Ethernet once again proved it was something that should not to be bet against. While not “remote,” in that there is no unique link layer protocol, what SR-IOV delivered was a more agnostic (read: Ethernet), localized and non-overlay approach to DMA that was targeted specifically towards virtualization.2
Derived from PCI terminology defining the element connecting the CPU and memory to the PCI switch fabric (Root Complex) and specifically the name given to the port itself (Root Port), SR-IOV allows a single Ethernet interface to appear -- much like VMDq -- as multiple separate physical ports. Mitigating the interrupt issue, however, is where SR-IOV excels over VMDq.
VM 3-ways: Normal operation through bottleneck eradication (VMDq) and interrupt mitigation (SR-IOV)
With SR-IOV, the hypervisor is responsible for carving out specific hardware resources in the NIC for each VM instance. Confusing all of us still trying to get our heads around NFV nomenclature, these are referred to as virtual functions (VFs). A virtual function consists of a limited, lightweight, PCIe resource3 and a dedicated transmit and receive packet queue. Loading-up a VF driver on boot-up, each VM is assigned one of these virtual functions by the hypervisor. The VF in the NIC is then given a descriptor that tells it where the user space memory, owned by the specific VM it serves, resides. Once received on the physical port and sorted (again by MAC address, VLAN tag or some other identifier) packets are queued in the VF until they can be copied into the VMs memory location.4
This interrupt-free operation liberates the VM’s CPU core to perform the compute operations demanded of the virtualized network function it is specifically hosting. Long story short, SR-IOV is fast. Indeed, the only thing faster is PCI passthrough, but I’m ignoring that as only one VM can use it at any one time, due to the fact that the PCI device is directly assigned to a guest. That said, PCI passthrough does have the advantage of being a pure software solution, unlike SR-IOV, which is based on specialized (if not standardized) hardware features. But I’m ignoring that as well.
The upside of all this DMA is also its downside, in that anything and everything between the interface and the VM is now bypassed. Even though the Juno release of OpenStack extended Neutron and Nova to support SR-IOV for network devices, thereby reducing the chance of provisioning errors and opening this data plane acceleration option up to the NFV management and orchestration (MANO) layer, the DMA techniques it employs still ultimately results in traffic bypassing the Hypervisor vSwitch. This raises questions about SR-IOV’s ability to support the portability, flexibility, QoS, complex traffic steering (including network service headers for service function chaining, if the VNFs are middleboxes) and the expected cloud network virtualization demands of NFV.
That's not to say you can’t run both SR-IOV and the vSwitch in parallel, within a single host or your cloud environment in general, because you absolutely can. This would allow NFVI engineers to use a vSwitch when they need it, SR-IOV when they can, or a combination of both if they dare. SR-IOV is an excellent option for “virtualization,” or the implementation of a stand-alone virtualized appliance or appliances, and it’s highly desirable to have an architecture where high-traffic VNFs, routers or Layer 3-centric devices use SR-IOV while Layer 2-centric middleboxes or VNFs with strict intra-host east-west demands employ a vSwitch. Such hybrid architectures, however, would likely layer additional management complexities and probably rule out the possibility of a common, efficient and flexible SDN host deployment infrastructure. Could I be any vaguer? Possibly, but probably not.
All that conjecture aside, the simple fact is that in NFV infrastructures we will need vSwitches somewhere -- if not everywhere -- but vSwitches are slow, lumbering, beasts. That’s not surprising as, in all fairness, the public gets what the public wants… and the public wanted flexibility, programmability and the ability to maintain network state, which doesn’t come for free. The price was paid in performance. If we talk Open vSwitch (OVS), from now on, attempts were made to improve performance with the introduction of concepts such as the Megaflow, (OVS 1.11 circa April 2011), which installs a secondary flow cache entry in the OVS kernel.
With OVS split between kernel and user space, an exact-match Microflow cache of recently forwarded packets, once identified and run through the pipeline in user space via a Netlink inter-process communication (IPC), is installed in the kernel space. The additional Megaflow cache includes a wildcard for each flow entry that matches on many packets, not just one. This prevents a wider range of traffic from having to make the performance-stifling IPC context switch between kernel and user space and hit the OVS daemon pipeline.
OVS Circa 1.11, featuring the kernel-level Megaflow Cache together with the Microflow Cache.
While it is claimed that, by OVS 2.1, this resulted in “ludicrous speed”5 enhancements, Megaflow’s ability to ‘accelerate’ OVS still depends on the characteristics and behavior of the packets and traffic patterns traversing it and therefore varies from application to application. Put simply, it either works or it doesn’t -- a level of unpredictability that doesn’t typically fly with those looking to develop and deploy commercial services. Either way, it still has that costly IPC, which is where the Data Plane Development Kit (DPDK) comes into play.
Formally introduced to the world a short time after Megaflow was making its debut (September 2011, to be specific), DPDK is a kit -- pure and simple. Leveraging the features of the underlying Intel hardware, it comprises a set of lightweight software data plane libraries and optimized NIC drivers that can be modified for specific applications. Leading the charge with its 6WindGate optimized offering and support, followed closely by the other wind: Wind (River, an Intel division), DPDK was released as open source to the development community at large, under the almighty BSD license, in 2013. While DPDK can be employed in any network function built to run on Intel architectures, OVS is the ideal use case.
The Data Plane Development Kit includes memory, buffer and queue managers, along with a flow classification engine and a set of poll mode drivers. Similar to the Linux New API (NAPI) drivers, it is the DPDK poll mode that performs the all-important interrupt mitigation that, as we know from SR-IOV, is critical to increasing the overall performance of the application. Operating in Polled Mode during periods of high traffic volume, the kernel periodically checks the interface for incoming packets, rather than waiting for an interrupt. To prevent "poll-ution" (yes, I just made that up to describe an excess of polls when no data is waiting for retrieval) while not leaving packets waiting when they do arrive (thereby increasing latency and jitter) DPDK can switch to interrupt mode when the flow of incoming packets falls below a certain threshold.
With the queue, memory and buffer managers, DPDK can also implement zero-copy DMA into large first in, first out (FIFO) ring buffers located in user space memory, a process akin to PF_RING. That, again, dramatically improves overall packet acquisition performance by not only enabling faster capture but smoothing-out bursty inbound traffic, allowing the application to handle the packets more consistently and therefore more efficiently. Plus, if the guest CPU gets busy with applications processing, it can leave the packets in the buffer a little longer without the fear of those packets being discarded. Naturally, this buffer -- along with the poll/interrupt thresholds -- needs to be managed closely with latency-sensitive applications such as voice.
The FIFO Ring Buffer
An open vSwitch can perform some magic, but magic always comes with a price and we end up paying in CPU Cores. That CPU tax, however, is dramatically reduced when CPUs are accelerated with a modified DPDK implementation. Those accelerating OVS using DPDK have done so with a separated control and data plane architectures that perform packet processing in the user space on dedication CPU cores to offload processing from Linux. Effectively, DPDK replaces the Linux kernel dataplane, meaning that both microflows and megaflows are handled in the user space while operating in the same manner.
DPDK Accelerated OVS Within User Space
Only complex, state-based control and management protocols (i.e. ARP) are sent to the Linux networking stack for processing. When comparing an accelerated vSwitch with a classic vSwitch implementation, however, the number of CPU cores required for virtualized network functions requiring high packet throughput rates is dramatically less. Charlie Ashton, from Wind, published some specific numbers on his blog about a year ago.6 With one OVS CPU core processing 0.3M (64-byte) pps, it would take 12 to support just four Mpps (or 1 Gbps full duplex Ethernet). Even though more cores mean more throughput, the number of VNFs a host can support is reduced with each additional one. Even with unfavorable 64-byte packet sizes, Wind’s Accelerated vSwitch can process 12 Mpps per core, and therefore only four are required to max out a single full-duplex 10-Gbps interface.
HP Enterprise recently published test results comparing bare metal with SR-IOV and an accelerated vSwitch implementation, this time from 6Wind.7 While the CPU tax must be taken into account, still, the results show that once outside the 64-byte packet realm, performance of both start hitting their stride with near peak performance.
Data plane acceleration testing results and comparisons, courtesy of HPE
While this looks good in a table, we should remember that typical Internet Mix (IMIX) traffic profiles have 64byte packets making up 50 to 60 percent of an average traffic sample. This would increase dramatically (if you can increase dramatically from 60 percent) if the VNFs in question happen to be handling real-time voice and video traffic… like a Session Border Controller, for example. This is exactly why we took our SBC out for a spin on both the SR-IOV and DPDK/OVS test tracks. As Metaswitch CTO Martin Taylor documented in a blog post back in January 2015, our results put many minds at ease.8 Having read this far, you won't be surprised to hear that SR-IOV resulted in a performance only a few percentage points less than that of our industry-leading bare metal SBC implementation. Most gratifying, however, was the fact that a modified DPDK OVS resulted in 90 percent of the performance seen in SR-IOV. Even with the additional CPU tax levied by such accelerated vSwitch implementations, that is a small pill to swallow, given the deployment flexibility they facilitate.
If all this makes it look like SR-IOV and DPDK, in general, are mutually exclusive, then I apologize. You can absolutely employ SR-IOV to write data into a VM hosting a VNF that is also using DPDK. As mentioned previously, you can also throw a DPDK accelerated vSwitch into the mix within a common host machine and even use both in unison (with the right SR-IOV target VNF plus an NFV infrastructure manager that was suitably equipped to handle such complexities) if it makes sense to do so from a deployment standpoint. This will remain to be seen.
So, thanks to the advice of Mr. Colton, what did I learn? Or, more specifically, what did I have to "unlearn"? Well, I learned that SR-IOV prevents not one but two interrupts and a huge bottleneck. Plus I discovered how damaging inter-process communication calls really are to a vSwitch and that DPDK takes them out of the equation. All in all, it certainly made me think.
Learn more about NFV data plane acceleration in this post: FD.io Takes Over VPP and Unites with DPDK to Accelerate NFV Data Planes to Outright Nutty Speeds.
1 Rev 1.1 was introduced in January 2010.
2 iWARP and later RoCE brought true RDMA to Ethernet, in answer to the threat of InfiniBand.
3The Physical Function (PF - the Ethernet port itself) continues to include a complete PCIe implementation.
4 These are not the physical memory locations, as the VM is unaware of those, so Intel Virtualization Technology for Directed I/O (VT-d) is required to perform the mapping between the virtual address space and the physical one.
5 Andy Hill and Joel Preas from Rackspace at the OpenStack Paris Summit in November 2014.
Simon is the Director of Technical Marketing and a man of few words.