Stack Wars and the Rise of the SDN Republic

Recently there is much insanity formed around the “SDN disruption” and the new “Stack Wars” its time to sit back and look what is going on.

The “Stack Wars” are in full swing ensnaring AMZN, VMW, MSFT, and  GOOG. With the recent aquisition of Nicira by VMWare the “Platform Wars” have exploded into an all out fight for the entire stack.  VMWare continues its dominance according to one survey suggesting a 24% lead over the next largest competitor OpenStack.  But VMWare  has yet to differentiate beyond the enterprise.

VMWare recently fired another shot from their Death Star publishing a new open source tool chain for release engineering, deployment and lifecycle management of large scale distributed services. Cloud Foundry BOSH opens up the world of poly-cloud services. According to  Steve Herrod latest post:

Cloud Foundry’s goal is to be the “Linux of the cloud.” Just as Linux provides a high degree of application portability across different hardware, Cloud Foundry provides a high level of application portability across different clouds and different cloud infrastructure. Steve Herrod, CTO VMware

So what about the EMC created and Maritz lead Project Zephyr, Both Tucci and Maritz are  tuned into the expanding market for insfrastructure and platform services expected to grow a combined $26.5B by 2016 , they must start to build a reputation outside of the Enterprise and go after the same Consumer IT market Amazon has been so successful in capturing.

OpenStack and have yet to prove their scalability and operational robustness under fire (although RACK is desperate to make Essex a success). Others are following suit recently RedHat finally pledged to the OpenStack initiative but there are still major issues in governance and fragile source based which I feel still make it a questionable platform to build your business on.

We must not forget about the >1M server Google and the massively scalable Amazon  coming downstream from Consumer IT  into the Enterprise (AMZN certainly has been going after the enterprise but not as a unification of public/private resource pools). If Larry Page and Jeff Bezos wise up they will start to offer their orchestration and management tools to use within enterprises and expand into poly-cloud control. This can benefit their bottom line  with a simple agreement of guaranteed public cloud usage which can easily be justified based on todays cloud sprawl. Having seamless access to secured and QOS aware Enterprise along with the scalability and platform richness of public clouds will shift the power to one of these heavyweights who might complete the “Death Star” and capture the “Linux of the Cloud” trophy.

SDN Disruption

In our core domain, there is a significant amount of confusion about what problems need to be solved and where. For instance having a rich set of API’s to manage infrastructure is simply a matter of economies of scale. Without lowering the average cost per unit (in terms of operations, robustness, flexibility, etc..) by means of automation, you are simply carrying an anchor around all day slowing your business. But is this SDN?

VMWare has moved into the world of SDN through the acquisition of Nicira.  VMware has been successful at virtualizing compute, storage and now networking. This can be considered the trifecta necessary to capture the ” control points” enabling them to be first in developing a unified Abstract Binary Interface to all infrastructure components. Those of you familiar with the Linux Standard Base or Single Unix Specification would recognize why this is extremely valuable in building the cloud operating system.

Each of their control points provide added value to build upon for instance the VM association to location, policy, metrics, QOS guarantees, etc.. These are incredibly valuable as is the network binding (mac->port->IP). With this information the owner can control any resource anywhere regardless of the network, operating system or the hypervisor..

Value of the Physical Network

So what about the role of the physical network.. We have heard many leaders discuss this in the context of commoditized switches, merchant silicon and proprietary fabrics.

There are significant challenges in optimizing networks especially data centers which require a mixed set of services and tenants such as  Unified Multi-Service Data Centers. There is a need for efficient topologies which maximize bi-sectional bandwidth, reduce the overhead in cabling and reduce operational complexity. The network fabric should work as well for benign traffic as it does for permutation traffic (i.e. many-to-one scenarios familiar to partition aggregate application patterns).

If I can’t utilize the full capacity of the network and be assured that I have properly scheduled workloads during permutation traffic interactions than certainly the physical network becomes an increasingly important design point. This requires changes to topology, flow control, routing and possibly the protocol architecture in order to arbitrate amongst the competing flows while maintaining low variance in delay and robustness to failures.

Realistically, the shift to 10GE network fabrics and host ports provide better scalability and to date application designers have yet to fully exploit distributed processing which means the data center traffic matrix is still fairly sparse.

As we move into the future, and workloads become more dense, one could argue that the physical network has a lot more it can accomplish. For instance ALL Fat-Tree architectures limit the available capacity of the network to the min-cut bi-sectional bandwidth. This means that overall throughput is limited to 50% of total capacity (Note: that is an ideal throughput because routing and flow control limit capacity further). The question for data center designers is will you pay for a network which they can only utilize a subset of 50% of the capacity they purchased? Well I certainly would be looking for options that would improve my cost model and this is an area which we haven’t yet found the secret sauce..

The reality is the way we architect networks today are far more efficient and offer more capacity than ever before. Load dependant bottlenecks show up way before you can exhaust the resources of the network which basically support the argument of network virtualization to reduce the amount of churn (i.e. state management) in the physical network allowing it to be more robust and reliable and predictable.

SDN to the Rescue?

The main problem today is the exhaustive manual effort in configuring all of the dependancies, dials, protocols and having to think about how things physically lay down together from the wiring to the VLAN associations, security policies etc.. This has become too cumbersome and impossible to reason about which is why overlays look so attractive. You no longer can codify or teach the network on a whiteboard, even representing all the different configuration noise on Visio’s are extremely complex and you still can’t reason holistically about network continuity, security and access control.

Reasoning about the network and applying formal verification testing before changes will allow networks to be much more predictable with much less complexity and failures. Todays switches and routers which require knowledge of complex data structures, different algorithmic complexities and interrelated dependancies cause a chain reaction of combinatorial issues. Between the link-layer and inter-domain routing there are many interactions which can go haywire and current techniques like static-analysis don’t cover the quadratic state explosion problem which exists in todays infrastructure software.

As far as SDN, its only a matter of time before the TCAM manufactures catch up to the requirements being forged in the ONF. Nick McKeown made a point in his SIGCOMM 2012 keynote that in a few short years we will power efficient TCAM’s with 100s of thousands of entries and multiple table support. Given that this is the primary bottleneck to complete the SDN ABI we will most likely see SDN become a very strong alternative to todays mix bag of control plane protocols. To be honest, rightly so.. This is not necessarily our fault but an artifact of the flawed protocol model developed at a  time where getting a character across the screen on a terminal was considered a huge step forward. This is most certainly not the world we live in anymore and unfortunately the specializations which have been built up to deal with this model are quickly being challenged..