Composable Networking in Massively Scalable Data Centers

Like many missives that have come before, this blog post begins with some blatant plagiarism. When talking network disaggregation, Russ White -- currently serving as a network architect at LinkedIn -- tops and tails his presentation with a picture of chocolate cookie ingredients to make a point that everyone should disaggregate. I thought it was a great analogy. I like cookies. Who doesn’t? While I can follow instructions to make a batch of cookies in the same way I can follow a MOP and deploy a disaggregated network element, it did frighten me a little. Indeed, my initial thought was “... yeah, but I’d just rather buy a cookie than bake one from scratch.” Then I realized that, in fact, I don’t.

composable-networking-in-massively-scalable-data-centers-blogI’m a salty sorta guy, so I never really ate a ton of sweets. Then I started getting older and thought I’d better take care of myself a little more than I had in the past. I practice being vegan -- with practice being the operative word, as it’s still a work in progress. I prefer dark chocolate, anyway, whereas most cookies use milk chocolate. My misses has a gluten sensitivity, so she would never help finish up any leftovers. The kids have a thing about cinnamon, which is fun around Thanksgiving, and friends nag me about farm-to-fork because (of the whole vegan thing) they think I care, bless ’em. I never buy cookies. Ever. Every so often, someone hands me a gluten-free, vegan, dark chocolate chip cookie and (contrary to conventional wisdom that the wrapping would be tastier) I actually enjoy it. I should really make them myself.

That’s what composable networking is about: Cookies. No, wait -- that’s not right. It’s about choice. Getting exactly what you want. Ensuring a network comprises simply the best components required -- only the applications you actually want and nothing that could put it at risk. If an infrastructure will operate better with dark chocolate chips, then it should get them -- not whatever Nabisco thinks it should have. If flour is going to upset its stomach, then we should use an alternative. If there’s a chance it will result in us reaching for an EpiPen, then why risk it? Let’s make sure there are no nuts in the network. Apart from the ones who take the night shift in the NOC.

Data Center Infrastructure Fabrics

While we might think this ability to granularly compose network switches and routers is only applicable to carrier infrastructure, where architecture and applications vary wildly, it’s also an important attribute for devices being deployed within massively scalable data centers (MSDC). In the battle of the data center, Layer 3 routed fabrics are becoming the dominant architectural paradigm.  

Therein lies a problem, however, because while they are dynamic and scalable, today’s routing protocols have some fundamental (but not insurmountable) problems when deployed in data center architectures. In all fairness, if indeed the world’s most widely deployed route discovery mechanisms need me to come to their defense, neither link state (shortest path first) or distance (path) vector protocols were designed for today’s data center architectures. OSPF, IS-IS, BGP, and the likes, were built to support any network topology with varied, practically random node distribution and interconnects. While modern, however, even the most innovative hyperscale data centers typically employ a distinctly unmodern switching arrangement.

A research engineer at Bell Labs, Charles Clos (pronounced “cloh” and in a French accent), introduced the concept of the non-blocking fabric that now bears his name in an October 1952 manuscript titled “A Study of Non-Blocking Switching Networks.” First published in the March 1953 edition of The Bell System Technical Journal, the paper described a method for designing non-blocking switching fabrics that don’t suffer the inefficiencies of n-squared laws, where the number of internal crosspoints had to equal the number or inputs multiplied by the number of outputs.  A three-layer array, this is achieved through an equal number of input and output switches connecting via identically sized intermediary switches through connections of a common size: a piece of wire back then, but bandwidth in today’s parlance.

Over half a century later, Clos topologies are admirably serving the east-west traffic needs of high-tech data centers. In data center phraseology, the input/output switches are called Leaf and the aggregation switches, Spine. Like the classic Clos model, neither Spine switches nor Leaf switches are typically directly connected to each other. As switches are now bidirectional and the input stages are the same at the output stages, Clos implementations within data centers clearly don’t reflect those outlined within that original research paper, so the resulting topological view is commonly referred to as a Folded Clos. Packets from a server enter through its associated Leaf switch to one of the Spine switches and are then forwarded to the receiving server’s Leaf switch. Equal Cost Multipath (ECMP) routing is used to load-balance traffic across the Spine.

closNow, while it’s the assumption that data centers are implementing pure Clos topologies, they are not. As you can see, Clos demands that each switch on one layer (i.e. Leaf) is connected to every switch in the other (Spine), requiring an equal number of switches across both. As you can imagine, this quickly becomes prohibitively expensive. For this reason, data center architects actually employ a variation of Clos called Fat Tree.

fat-treeA derivative of classic tree topologies, this is still not a new architectural approach or terminology, of course. Envisioned by the computer scientist Charles Leiserson and published in an October 1985 IEEE journal, Fat Trees were originally conceived of for interconnecting processors in parallel supercomputers. The major difference between a pure Clos is that a Fat Tree neither requires full interconnection nor depends on an equal number of switches at each layer. Fat Tree topologies can have a variable number of hierarchical aggregation layers. Once again, because of the bidirectional nature of today’s switches, our Fat Tree is still technically folded, but as it actually looks tree-like in structure, that prefix is not typically used.

Emerging Routing Methodologies for MSDCs

The question of what (imperfect) routing protocol to employ in the Fat Tree topologies is not as easy to answer as you might imagine. The Border Gateway Protocol (BGP) is the obvious response. After all, we have been suggesting the use of BGP for everything for the last 20 years, right? You would not be entirely wrong with that answer… but not completely correct either. As a path (distance) vector protocol, a specific link failure can trigger a large number of prefixes to be withdrawn, making reconvergence times lengthy. Link state interior gateway protocols (IGP) like OSPF and IS-IS, which use Dijkstra's shortest path first (SPF) algorithms from circa 1956, are a little more complex but fast, performing their calculations at the same time the routes are being propagated. This is achieved, however, at the expense of maintaining an entire copy of the network by employing a rather inefficient broadcast (flooding) techniques.

Some data center architects chose to employ an IGP, like IS-IS with modifications such as RFC 7356, or even iBGP over IGP, as in classic routed networks. Their added complexity and the resource inefficiencies presented by the broadcast nature of their operation, however, has resulted in exterior BGP (eBGP) being favored in layer 3 MSDC fabrics. But not without some concessions. RFCs, such as 7938 (informational), outline modifications to the standard protocol implementation -- or, relax some of the typical prerequisites -- so that it can better serve Folded Clos/Fat Tree MSDC topologies. This includes working around limitations, such as autonomous system (AS) numbering.

Both IGP and BGP alternatives suffer from some significant -- and common -- issues. This includes the propensity toward route black holes, plus the possibility of the excessive propagation of network state changes, amplified by the large number of densely packed links in such MSDC implementations. With classic routing propositions, there is also an excess amount of link state information held on every stage, including stubby-ish, bottom-layer Leaf/ToR switches that should be low-cost devices. Finally, if we know the topology, then why not have the protocols ensure there are not violations?

Wanting to do better than simply shoehorn eBGP into a very hierarchical and symmetrical Fat Tree, two working groups have recently been formed within the IETF to define a better approach to implementing routing within data centers with well-defined topologies. In no particular order (because there is no favoritism here), one initiative is called Routing in Fat Trees (RIFT) and the other, Link State Vector Routing (LSVR). Both working groups have similar aims: Defining a hybrid approach to Layer 3 routing for data centers. Why, then, are there two? Probably simply because one is being driven by Juniper and the other by Cisco, but, again, I am staying out of the politics. Either way, both claim to make strides toward permanently fixing the problems of data center routing with current protocol implementations and hacks.

The goal is to employ the best attributes of link state and distance vector protocols while eliminating some of the negatives. Fundamentally, this means minimizing routes, fast convergence and dynamic topology detection while reducing flooding and the wide propagation of updates.

A great example of this hybrid approach, RIFT employs link state protocol techniques from the lower layers (ToR) to the upper layers (Core/Spine) and distance vector in the opposite direction. In effect, flooding state information south to north but calculating adjacency tables at each stage, north to south, with only a default route being required in most cases and therefore installed and propagated.


A simplified view of Routing in Fat Trees (RIFT)

The lower-level (ToR) switches see only their neighbors and have no visibility to the rest of the network, while the top-level (Spine) switches have a complete view of network topology. The result is fast and efficient convergence while minimizing the effect of outages or bounces and the chance of routing black holes. Obviously, this is a very simplistic view. There are other advantages but still much work to do, as these groups have only really formally convened.

Naturally, it’s not as simple as I’m suggesting here… because it never is. Along with the two new IETF Working Groups I’ve outlined above, there are also two stand-alone draft contenders coming out of the Networking Working Group. And while I’ve been harping-on about the virtues of distance vector, both of them look to modify the IS-IS link state protocol for efficient data center routing. One of those is also backed by Cisco (IS-IS Routing for Spine-Leaf Topology) while the other (OpenFabric) has LinkedIn behind it and are, therefore, both strong contenders.

This uncertainty is why the adoption of composable networking philosophies is critical. The beauty of composable networking is that you only get what you need. What you want and like. The good stuff. Nothing that can cause you any discomfort or harm. You get to choose what’s best for your infrastructure in order for it to operate at its optimum. And if your taste changes, then you can switch ingredients any time.

Now, a lot of people are going to continue to grab a prepackaged (switch/router) cookie off the shelf and I respect that. In this foodie culture, however, I think that more and more network architects and engineers will be building theirs from scratch. Just please don’t be sending me pictures of yours.

For more information about our portfolio of Composable Networking Protocols, please click here.