The Myth of the Carrier-Grade Cloud

Many of the network operators that we are talking to about virtualized network functions (VNFs) are investing heavily in building specially engineered clouds for network functions virtualization (NFV). Others, however, are  looking closely at some of the commercial “carrier-grade” cloud solutions that are on the market. This cautious shopping for the perfect telco-like cloud seems to be one of the key reasons that NFV is taking so long to get into production.

Our view has always been that NFV should, as far as possible, take advantage of standard IT-grade cloud technologies.  For sure, NFV imposes some specific requirements that can only be addressed by technologies not commonly seen in IT-centric data centers.  These requirements are mostly in the area of network performance, and over the last couple of years, OpenStack has advanced in leaps and bounds to meet that need, for example by embracing Single Root I/O Virtualization (SR-IOV). We think that standard OpenStack distros from the likes of Red Hat, Mirantis and Canonical are now more than ready for production deployments of NFV, but many of our customers don’t agree with this position.

 We’ve been trying to get a better understanding of their problems with standard OpenStack distros, and it appears that many of their issues are associated with achieving high availability (HA).  The problem seems to be that most of the VNFs being offered by their traditional vendors are simple ports of software originally designed to run on hardware appliances, and for a number of reasons, these VNFs are not able to deliver a highly available service on a standard OpenStack cloud.

I was initially a bit mystified by this.  As a software vendor, our approach has always been to place no trust in the availability of any part of the system on which our software runs.  Failure can (and will) occur at any level of the stack, from hardware through OS to our own application code.  And in a cloud environment, where the stack includes a hypervisor, a software-defined network and a cloud control plane, these are all points of failure we need to worry about too.  In our view, the application needs to be designed in such a way that it takes full responsibility for ensuring that the service is always available, regardless of any failure at any level in the stack that supports it, or any failure in the application code itself.  This means that the application must be able to detect when an event occurs at any layer in the stack that impacts service availability, and then recover itself automatically.
This approach to high-availability design requires nothing special from the cloud.  From our point of view, the cloud is just one more layer in the stack that can fail at any time, and our application needs to be able to cope with that. Needless to say, when deployed as per our guidelines, all our VNFs are capable of delivering better than five-nines availability when deployed on standard off-the-shelf OpenStack distros.

Why do VNFs from other vendors apparently require some special engineering in the cloud to deliver high availability? I think the answer is that many of these VNFs started life as software loads running on proprietary hardware appliances, and their approach to high availability has been shaped by that environment.

Historically, many telco appliances have embodied an architecture that set out to deliver high availability from some combination of specialized hardware, embedded operating system software and HA middleware.  The OS, middleware and hardware took care of state replication, fault detection and failover so that the application software didn’t have to.  This, of course, made it much easier to write the application software, which was immensely appealing to hardware vendors.

The problem is that this model doesn’t translate very well to the cloud.  The cloud assumes that the application is going to run on some standard operating system image (typically Linux), not some specialized embedded OS.  Proprietary HA middleware layers don’t sit well with hypervisors.  And hardware-specific functions supporting HA are just not available to either middleware or to the application software.

Many network operators appear to have been convinced by their traditional vendors that they need a "carrier-grade" cloud for NFV.  I don’t really know what that means, but I suspect it’s a smokescreen for the fact that VNFs from these vendors can’t deliver HA without some special sauce in the cloud.  And of course, the longer it takes network operators to put that "carrier-grade" cloud in place to support NFV, the longer they will have to go on purchasing traditional hardware appliances from those same vendors.

Here is my advice to network operators:

  1. Start deploying NFV now using standard off-the-shelf OpenStack together with VNFs that can meet your service availability requirements when deployed on this type of cloud environment.  That way, you can start enjoying the benefits of NFV without waiting any longer than necessary.
  2. Put pressure on your vendors to deliver VNFs that don’t require anything special from the cloud to support high availability.  If they can’t do this, look for other vendors who can.
  3. Don’t listen to anyone who tells you that you need a "carrier-grade" cloud.  Netflix, WhatsApp, Facebook, Twitter and thousands of other Web-scale services deliver carrier-class service availability without the crutch of a "carrier-grade" cloud. Most of them, in fact, run on mass-market public cloud environments.

We think most network operators actually do understand that VNFs should deliver high availability without needing anything special from the cloud.  By departing from the mainstream of open source cloud software evolution to compensate for poorly-engineered VNFs, we think network operators will be building a long-term problem for themselves.  This problem will manifest either in the form of vendor lock-in, or by creating a perpetual need to apply a ton of patches to each successive release of OpenStack.

By far the safer option, in our view, is to stick with “standard” OpenStack, start deploying VNFs from network software providers (NSPs) that demonstrate they can deliver a carrier-grade service in that environment and continue to push their other vendors hard to deliver VNFs that are designed the way they should be – that is, with full responsibility for service availability built in.

More on this topic: