How Stateless Processing Enables Massive Scalability

Stateless processing is a prominent design feature of cloud native network function (CNF) architecture. But what is it and why is it so important for telcos? In short, it enables massive scalability and is inherently fault tolerant. 


The following is excerpted from our recent white paper, Cloud Native Network Functions: Design, Architecture and Technology Landscape, and it best explains how stateless processing works and why it matters for telcos.

The concept of stateless processing can be described as follows: A transaction processing system is divided into two tiers. One tier comprises a variable number of identical transaction processing elements that do not store any long-lasting state. The other tier comprises a scalable storage system based on a variable number of elements that store state information securely and redundantly. The transaction processing elements read relevant state information from the state store as required to process any given transaction, and if any state information is updated in the course of processing that transaction, they write the updated state back to the store.

It’s probably not obvious from reading the description above how this approach enables massive scalability. So, let’s use a practical example to illustrate.

Suppose we are developing an e-commerce application. The application needs to support a number of HTTP transaction types including login to account, add item to shopping basket, review shopping basket, checkout etc. The application code that processes these transactions needs access to certain information (i.e. “state”), for example, details of the user’s account and the current contents of the shopping basket. In a traditional application architecture, this state would be kept in the application’s local storage.

The first problem that we need to solve is how to provide fault tolerance. If a server dies, then any local state that is stored in it is lost. The physical servers that are deployed in cloud environments are not particularly reliable, and failures are fairly frequent. Users get pretty upset if they’ve spent 30 minutes online grocery shopping, and their shopping basket suddenly disappears. We face a difficult choice here: either we accept the risk that a small proportion of e-commerce sessions will fail due to equipment failure, or we have to deploy a second server to act as a backup, and maintain a shadow copy of all the state on it – which doubles the amount of hardware resources the application is consuming.

Now suppose that we need this application to support millions of concurrent online shopping sessions. A single server (or active-standby pair of servers) is not going to be able to handle the load, so we need to deploy a number of servers. The problem that we now need to solve is that each incoming HTTP request needs to be directed to the correct server, the one that knows about this particular user and session.

We therefore need to deploy something like a load-balancer in front of our collection of servers, and the load-balancer needs to be able to identify the user and session from the information in each incoming request, remember which server is handling each user session, and redirect each request to the correct server. The load-balancer is therefore quite a complex application in its own right. And because it’s potentially a single point of failure, it needs to be fault tolerant, which makes it even more complex.

But the biggest single issue here is that the performance and capacity of the load-balancer puts an upper limit on the transaction processing load that we can handle. What happens if our e-commerce site is wildly successful and we cannot obtain a load-balancer that is powerful enough to handle all of the demand?

With the stateless processing approach, we implement the elements that process HTTP transactions without any local state storage and have them read and write state to and from a separate storage system. When an HTTP request arrives at one of these elements, it extracts some information from the request that uniquely identifies the session (for example, from a cookie), and then uses this information to retrieve the current state associated with this session (user account details, contents of shopping basket) from the state store. If the transaction has the effect of changing any of this state, for example, because the user added an item to her shopping basket, then the transaction processing element writes the updated state back to the state store.

The difference now is that any incoming HTTP request can be handled by any arbitrary instance of the transaction processing element. We do not have to steer each request to the instance that “knows” about it, because knowledge about each session is available to every processing element instance from the state store. We still need some way to balance the load of incoming requests across the population of transaction processing elements, but we can do this without having to deploy a load-balancer, for example, by leveraging DNS to perform dumb round-robin load balancing. By eliminating the load-balancer, we’ve eliminated the limiting factor on scale. We also don’t need to worry about any individual transaction processing element failing. Such failures do not result in the loss of any state, because all the state is stored separately.

The stateless approach is therefore inherently fault tolerant. If any processing element instance dies or becomes unresponsive, then the built-in re-try mechanisms of HTTP will result in subsequent attempts being handled by another instance. So long as we have a modest amount of performance headroom in our population of processing elements, the failure of any one of them has no impact on the service: the load that it would otherwise have handled is simply re-distributed across the remaining instances. We can very easily extend this fault tolerance mechanism across multiple data centers, so that even the loss of an entire data center will not bring down our service.

Individual processing elements can be quite small in scale: we can keep the architecture of these elements simple by not worrying about trying to make them very powerful, for example, with support for lots of multi-core parallelism. We handle scaling by deploying as many processing element instances as we need to handle the load, an approach which is known as “scale out” (in contrast to “scale up”, which involves deploying bigger server instances). We can also change the number of processing elements on the fly (scaling both out and in) in response to changing load – enabling us to make the most efficient use of compute resources at all times.

All of this depends, of course, on our ability to build and deploy a highly scalable and very fault-tolerant storage system in which to keep all of our application state. Because this is an absolutely fundamental requirement of the stateless processing design pattern, there has been a lot of investment in this area, particularly by the main Web-scale players.

Many of the solutions that they have built to address this need are available as open source. For example, one of the leading distributed state stores, Apache Cassandra, was originally developed at Facebook, and is now used by Netflix, Twitter, Instagram and Webex among many others. Test results published by Netflix show Cassandra performance scaling linearly with number of nodes up to 300, and handling over a million writes per second with 3-way redundancy – more than enough to handle the needs of most telco-style services even with many hundreds of millions of subscribers. Cassandra includes support for efficient state replication between geographically separate locations, and therefore provides an excellent basis for extremely resilient geo-redundant services.

It’s perhaps worth pointing out that stateless processing is by no means the only design pattern seen in cloud native applications, although it’s definitely the most prominent. Other design patterns worth mentioning include stream processing (based on frameworks such as Heron or Storm) and serverless processing, best exemplified by Amazon Lambda. These have only emerged relatively recently, but they definitely have potential to advance the state of the art in CNF design.

For a deeper dive into cloud native architecture, please download the white paper.