Stack Wars and the Rise of the SDN Republic

Recently there is much insanity formed around the “SDN disruption” and the new “Stack Wars” its time to sit back and look what is going on.

The “Stack Wars” are in full swing ensnaring AMZN, VMW, MSFT, and  GOOG. With the recent aquisition of Nicira by VMWare the “Platform Wars” have exploded into an all out fight for the entire stack.  VMWare continues its dominance according to one survey suggesting a 24% lead over the next largest competitor OpenStack.  But VMWare  has yet to differentiate beyond the enterprise.

VMWare recently fired another shot from their Death Star publishing a new open source tool chain for release engineering, deployment and lifecycle management of large scale distributed services. Cloud Foundry BOSH opens up the world of poly-cloud services. According to  Steve Herrod latest post:

Cloud Foundry’s goal is to be the “Linux of the cloud.” Just as Linux provides a high degree of application portability across different hardware, Cloud Foundry provides a high level of application portability across different clouds and different cloud infrastructure. Steve Herrod, CTO VMware

So what about the EMC created and Maritz lead Project Zephyr, Both Tucci and Maritz are  tuned into the expanding market for insfrastructure and platform services expected to grow a combined $26.5B by 2016 , they must start to build a reputation outside of the Enterprise and go after the same Consumer IT market Amazon has been so successful in capturing.

OpenStack and have yet to prove their scalability and operational robustness under fire (although RACK is desperate to make Essex a success). Others are following suit recently RedHat finally pledged to the OpenStack initiative but there are still major issues in governance and fragile source based which I feel still make it a questionable platform to build your business on.

We must not forget about the >1M server Google and the massively scalable Amazon  coming downstream from Consumer IT  into the Enterprise (AMZN certainly has been going after the enterprise but not as a unification of public/private resource pools). If Larry Page and Jeff Bezos wise up they will start to offer their orchestration and management tools to use within enterprises and expand into poly-cloud control. This can benefit their bottom line  with a simple agreement of guaranteed public cloud usage which can easily be justified based on todays cloud sprawl. Having seamless access to secured and QOS aware Enterprise along with the scalability and platform richness of public clouds will shift the power to one of these heavyweights who might complete the “Death Star” and capture the “Linux of the Cloud” trophy.

SDN Disruption

In our core domain, there is a significant amount of confusion about what problems need to be solved and where. For instance having a rich set of API’s to manage infrastructure is simply a matter of economies of scale. Without lowering the average cost per unit (in terms of operations, robustness, flexibility, etc..) by means of automation, you are simply carrying an anchor around all day slowing your business. But is this SDN?

VMWare has moved into the world of SDN through the acquisition of Nicira.  VMware has been successful at virtualizing compute, storage and now networking. This can be considered the trifecta necessary to capture the ” control points” enabling them to be first in developing a unified Abstract Binary Interface to all infrastructure components. Those of you familiar with the Linux Standard Base or Single Unix Specification would recognize why this is extremely valuable in building the cloud operating system.

Each of their control points provide added value to build upon for instance the VM association to location, policy, metrics, QOS guarantees, etc.. These are incredibly valuable as is the network binding (mac->port->IP). With this information the owner can control any resource anywhere regardless of the network, operating system or the hypervisor..

Value of the Physical Network

So what about the role of the physical network.. We have heard many leaders discuss this in the context of commoditized switches, merchant silicon and proprietary fabrics.

There are significant challenges in optimizing networks especially data centers which require a mixed set of services and tenants such as  Unified Multi-Service Data Centers. There is a need for efficient topologies which maximize bi-sectional bandwidth, reduce the overhead in cabling and reduce operational complexity. The network fabric should work as well for benign traffic as it does for permutation traffic (i.e. many-to-one scenarios familiar to partition aggregate application patterns).

If I can’t utilize the full capacity of the network and be assured that I have properly scheduled workloads during permutation traffic interactions than certainly the physical network becomes an increasingly important design point. This requires changes to topology, flow control, routing and possibly the protocol architecture in order to arbitrate amongst the competing flows while maintaining low variance in delay and robustness to failures.

Realistically, the shift to 10GE network fabrics and host ports provide better scalability and to date application designers have yet to fully exploit distributed processing which means the data center traffic matrix is still fairly sparse.

As we move into the future, and workloads become more dense, one could argue that the physical network has a lot more it can accomplish. For instance ALL Fat-Tree architectures limit the available capacity of the network to the min-cut bi-sectional bandwidth. This means that overall throughput is limited to 50% of total capacity (Note: that is an ideal throughput because routing and flow control limit capacity further). The question for data center designers is will you pay for a network which they can only utilize a subset of 50% of the capacity they purchased? Well I certainly would be looking for options that would improve my cost model and this is an area which we haven’t yet found the secret sauce..

The reality is the way we architect networks today are far more efficient and offer more capacity than ever before. Load dependant bottlenecks show up way before you can exhaust the resources of the network which basically support the argument of network virtualization to reduce the amount of churn (i.e. state management) in the physical network allowing it to be more robust and reliable and predictable.

SDN to the Rescue?

The main problem today is the exhaustive manual effort in configuring all of the dependancies, dials, protocols and having to think about how things physically lay down together from the wiring to the VLAN associations, security policies etc.. This has become too cumbersome and impossible to reason about which is why overlays look so attractive. You no longer can codify or teach the network on a whiteboard, even representing all the different configuration noise on Visio’s are extremely complex and you still can’t reason holistically about network continuity, security and access control.

Reasoning about the network and applying formal verification testing before changes will allow networks to be much more predictable with much less complexity and failures. Todays switches and routers which require knowledge of complex data structures, different algorithmic complexities and interrelated dependancies cause a chain reaction of combinatorial issues. Between the link-layer and inter-domain routing there are many interactions which can go haywire and current techniques like static-analysis don’t cover the quadratic state explosion problem which exists in todays infrastructure software.

As far as SDN, its only a matter of time before the TCAM manufactures catch up to the requirements being forged in the ONF. Nick McKeown made a point in his SIGCOMM 2012 keynote that in a few short years we will power efficient TCAM’s with 100s of thousands of entries and multiple table support. Given that this is the primary bottleneck to complete the SDN ABI we will most likely see SDN become a very strong alternative to todays mix bag of control plane protocols. To be honest, rightly so.. This is not necessarily our fault but an artifact of the flawed protocol model developed at a  time where getting a character across the screen on a terminal was considered a huge step forward. This is most certainly not the world we live in anymore and unfortunately the specializations which have been built up to deal with this model are quickly being challenged..

How to Take Any App to the Cloud event, May 17th , NYC.

GigaSpaces is hosting a cloud computing meetup on how to take any app to the cloud with an all-star panel of CTOs and technical evangelists. Join us for this free, high-energy event on May 17th to hear real world uses from Microsoft, Aditi, Cisco, GigaSpaces, C24, a Fortune 100 Financial Institution, and learn how cloud computing is opening a whole new world of possibilities for enterprises and ISVs.

Network Mechanics to Network Conductors

I get a general sense that the latest incarnation of network evolution (i.e. SDN) is becoming a way of expressing the frustration with dealing with a complex set of problems, which have yet to be solved. One of the things you have to ask yourself, as a network professional is “What do I really understand about the fundamentals of networking and how do I put that to use in the post-PC data hungry world?”

For years the best way to understand networking was to lug out your Network General Sniffer and watch the interaction of messages flowing across the screen. We had basic signals such as connection management; we had a general understanding of the traffic matrix by interrogating the network addresses, which we compared to our spreadsheets and some heuristics about flows. We leveraged the emerging SNMP standard to first collect traffic statistics into our pre-RRD datastores and presented pretty graphs of utilization to understand demand. Soon we had some expert systems, which would track the various protocol state transpiring between hosts and interpreting the results.

Scaling the data center meant learning about aggregation/distribution, the ratio of local traffic from remote. At the time most network engineers were taught the 80/20 rule i.e. 80% of the traffic stays local and only 20% is remote. This was a direct play on our centralized compute models, mainframes and the fact that most people were still using terminal based computing and sneakernet. It became the foundation of network design, which reflected this by oversubscribing capacity higher in the tree (i.e. Core, Distribution, Edge design).

Network automation was still in its infancy; you would use a floppy disk to update the firmware and operating system. Upgrading a Cisco router meant getting your terminal configured with the appropriate Xmodem/Zmodem settings and waiting hours while your data was serialized down a modem from the Cisco CCO BBS site.

Soon we were leveraging scripting languages like Expect and Perl to handle the complexity of managing network state across all the configuration files. Once you could use the SNMP private MIB to read and write a device configuration you could make global changes in an instant and repopulate the configurations across the world. In some ways this was all a step backwards from the advancing telecommunications control system present in the day, it was still a very closed and proprietary world leaving customers no choice but to adopt some complex and monolithic management applications.

So its 2012 and we are not much better at dealing with all of the challenges in running such a complex system as the network. IETF finally got its act together and delivered a more robust management framework through an application protocol called NetConf and an information modeling definition called YANG. Finally you can divorce the information model from the data transfer protocol and allow for a cleaner representation of the network configuration. But is this as far as we need to go? Why is SDN so interesting and what is it telling us about the still very complex problems with building, and operating networks?

As the title of the blog suggests, I think something can be said for the expertise required to manage complex systems. Question becomes, are you going to stay being a mechanic and worrying about some low-level details or are you going to be the pilot? Is it valuable to your employer for you to understand the low-level semantics of a specific implementation or rise above by creating proper interfaces to manipulate the state of the network through a reusable interface?

With information becoming more valuable than most commodities it will take a shift in mindset to move from low-level addressing concerns to traffic analysis, modeling and control. Understanding where the most important data is, how to connect to it and avoid interference will become much more important than understanding protocols.

So how does SDN contribute to this and how do we get from the complex set of tasks of setting up and operating networks to more of a fly-by-wire approach? How do we go from managing a huge set of dials and instruments to managing resources like a symphony?

The first thing to recognize is you can’t solve this problem in the network by itself!!. For years application developer’s expectations of the network were of infinite capacity and zero latency. They perceived that the flow-control capability in the network would suffice giving them ample room to pummel the network with data. Locality was far behind even an after-thought because they were developing on local machines unaware of the impact of crossing network boundaries. Networking guys use terms like latency, jitter, bandwidth, over-subscription, congestion, broadcast storms, flooding while application developer’s talk in terms of consistency, user experience, accuracy and availability.

The second thing to recognize is the network might need to be stripped down and built back up from scratch in order to further deal with its scaling challenges. In my eyes this is the clearest benefit to SDN as it highlights some of the major challenges in building and running networks. Experimenting with a complex system is disastrous; in order to break new ground it must be decomposed into its simplest form but certainly no simpler as Einstein would say. Its possible that OpenFlow has gone this route and must be redesigned into a workable set of primitive functions which can be leveraged not just through a centralized controller model but also to adapt new Operating Systems and protocols to leverage the hardware.

There is much debate over what the “best” model is here and what the objectives are. Since most networking is basically a “craft” and not a science there are those who strive to maintain the existing methodologies and mechanisms and simply open up a generalized interface to improve control. Others might see this as a mistake as if you reproduce the current broken layering model you are bound to run into a new set of challenges down the line which may require another patch, protocol or fix to solve.

Maybe an approach of looking back at the fundamentals of networking, what has been learned through the course of history, how other protocols behave and a reflective look at our industry would be valuable. How do you deal properly with connection management, data transfer efficiency, flow control? How do you leverage proper encapsulations and hierarchy to scale efficiently? What should management look like and how do you separate mechanism from policy and deliver hop-by-hop QOS?


In some regards the move towards Software Defined Network is an outcry of the frustration in managing an ever, complex set of interrelated components. Data centers have become huge information factories; servers themselves have become cluster of computers and our data hungry applications require massive amounts of parallel computing driving even more demand into the network. We could continue to take a ill-suited feature-driven approach to networking or we could take the opportunity to recognize what are the architectural principals to networking which would turn it from a craft to a science (not withstanding the argument on true science).

Networking Guy’s, Just don’t understand software

Before I begin my rant, let me just say my first router was a WellFleet CN with a VME Bus and my first Cisco router was an AGS+.. I have been around long enough to see DecNet, IPX, IP, SNA, Vines and a few others running across my enterprise network while troubleshooting 8228 MAU’s beaconing in the wiring closets and watching NETBEUI “Name in Conflict” packets take down a 10’000 node RSRB network.

Its 2012 and gone are the days when network engineers need to juggle multiple protocol behaviors such as IPX GetNearestServer, IP PMTU bugs in Windows NT 3.5.. or trying to find enough room in your UMB to fit all your network ODI drivers without crashing Windows.

Its a new age, and as we look back at almost 40 years since the inception of the Internet and almost 20 years since TCP/IP was created, we are at the inflection point of unprecedented change fueled by the need to share data, anytime, anywhere on any device.

The motivation for my writing this entry comes from some very interesting points of view from my distinguished ex-colleague Brad Hedlund entitled “Dodging Open Protocols with open software“. In his post he tries to dissect both the intentions and impact of a new breed of networking players such as Nicira on the world of standardized protocols.

The point here isn’t to blow a standards dodger whistle, but rather to observe that, perhaps, a significant shift is underway when it comes to the relevance and role of “protocols” in building next generation virtual data center networks.  Yes, we will always need protocols to define the underlying link level and data path properties of the physical network — and those haven’t changed much and are pretty well understood today.

The “shift in relevance and role of protocols”  is attributed not necessarily at what we know as the IETF/IEEE based networking stack and all the wonderful protocols which make up our communications framework, but in a new breed of protocols necessary to support SDN.

Sidebar: Lets just go back a second and clarify the definition of SDN. Some define Software Defined Networking in terms of control plane, data plane separation, which clearly has been influenced by the work on OpenFlow.

So the shift that we see in networking which is towards more programmability and the fact that we need new ways to invoke actions and carry state is at the crux of this shift..

However, with the possibility of open source software facilitating the data path not only in hypervisor virtual switches, but many other network devices, what then will be the role of the “protocol”? And what role will a standards body have in such case when the pace of software development far exceeds that of protocol standardization.”

Ok so this is the heart of it.. “what then will be the role of the “protocol”? And what role will a standards body have in such case when the pace of software development far exceeds that of protocol standardization.”

I think the problem here is not necessarily the semantics of the word “protocol” (for this is just a contract which two parties agree upon), but the fact that there is a loosely defined role in how this “contract” will be standardized to promote an open networking ecosystem.

Generally standardization only comes when there is sufficiently understood and tested software which provide the specific implementation of that standard. Its very hard to get a protocol specification completely right without testing it in some way..

Sidebar: If you actually go back in history you will find that TCP/IP was not a standard.. The INWG was the governing standards body of the day in defining the international standard which was supposed to be INWG 96 but because the team at Berkley got TCP up into BSD Unix, well now its history..I wrote a bit about it here:

With that in mind, take a closer look at the Open vSwitch documentation, dig deep, and what you’ll find is that there are other means of controlling the configuration of the Open vSwitch, other than the OpenFlow protocol.

When it comes to OVS its very important not to confuse interface and implementation. Since OVS in a classical form just a switch, you operate it through helper routines to manipulate the state management layer in the internal datastore called OVSDB and interact with the OS. This is no different than say a CLI on a Cisco router. Most of the manipulation in the management plane will probably be exposed through JSON-RPC (Guessing here) through a high-level REST interface.

What you must understand about OVS when related to control plane/data plane separation or  “flow-based network control” is you are essentially changing the behavior from a standardized switch based on local state to a distributed forwarding engine coordinated with global state.

From OVS:

The Open vSwitch kernel module allows flexible userspace control over flow-level packet processing on selected network devices. It can be used to implement a plain Ethernet switch, network device bonding, VLAN processing, network access control, flow-based network control, and so on.

Clearly since we are in the realm of control plane/data-plane separation we need to have a protocol (i.e. contract) which is agreed upon when communicating intent. This is where OpenFlow comes in..

Now unfortunately OpenFlow is still a very nascent technology and is continuing to evolve but Nicira wants to solve a problem. They want to abstract the physical network address structure in the same way that we abstract the memory address space with VMM’s (see Networking doesn’t need VMWARE but it does need better abstractions). In order to do this they needed to jump ahead of the standards bodies (in this case the ONF) and adopt some workable solutions.

For instance, OVS is not 100% compliant with OpenFlow 1.0 but has contributed to better models which will appear soon in the 1.2 specification. OVS uses an augmented PACKET_IN format and matching rules

/* NXT_PACKET_IN (analogous to OFPT_PACKET_IN).
* The NXT_PACKET_IN format is intended to model the OpenFlow-1.2 PACKET_IN
* with some minor tweaks. Most notably NXT_PACKET_IN includes the cookie of
* the rule which triggered the NXT_PACKET_IN message, and the match fields are
* in NXM format.


Open Source networking is nothing new, you have XORP, Zebra, Quagga,, Vyatta and standard bridging services built into Linux.

Just like with TCP/IP if there is value in OpenFlow or whatever its derivatives are we will see some form of standardization. OVS is licensed under Apache 2, so if you want to fork it go ahead thats the beauty of software. In the mean time I wouldn’t worry so much about these control protocols, they will change over time no doubt and good software developers will encapsulate the implementations and publish easy to use interfaces.

What I think people should be asking is not so much about the protocols (they all suck in their own way because distributed computing is really, really hard) but what can we do once we have exposed the dataplane in all its bits to solve some very nasty and complex challenges with the Internet?.

Networking doesn’t need VMWARE but it does need better abstractions

Lately there has been a lot of talk around the network and the corresponding conflation of terms and hyperbole around “Network Virtualization including Nypervisor, Software Defined Networking, Network Abstraction Layer, SDN, OpenFlow, etc.

Recently a blog entry entitled “Networking Needs a VMWare (Part 1: Address Virtualization)” appeared on Martin Casado’s blog which tries to make a case for comparing the memory virtualization capability in today’s modern hypervisors to network virtualization.

This sort of left an uneasy feeling in fully describing why we are seeing this activity in the network domain specifically to deal with the broken address architecture. This post is to try and bring some clarity to this and to maybe dig deeper into the root causes or problems in networking which have led us to this point.

The synopsis in the blog goes like:

One of the key strengths of a hypervisor lies in its insertion of a completely new address space below the guest OS’s view of what it believes to be the physical address space. And while there are several possible ways to interpose on network address space to achieve some form of virtualization, encapsulation provides the closest analog to the hierarchical memory virtualization used in compute. It does so by taking advantage of the hierarchy inherent in the physical topology, and allowing both the virtual and physical address spaces to support complete forwarding and addressing models. However, like memory virtualization’s page table, encapsulation requires maintenance of the address mappings (rules to tunnel mappings). The interface for doing so should be open, and a good candidate for that interface is OpenFlow.

The author of the blog post is trying to describe a well-known aphorism by David Wheeler, which states: “All problems in computer science can be solved by another level of indirection”. This statement is at the heart of “virtualization” as well as other references in communications layering, computer architecture and programming models.

Sidebar OSI Model

Lots of networking professionals like to refer to the 7-layer OSI model when talking about network abstractions. The problem is the OSI model was never adopted; in addition most OSI engineers agree that the top 3-layers of the OSI (Application, Presentation and Session) belongs in “one” application layer. We utilize a derivative of that model which is essentially the four-layers representative in the TCP/IP model.

Lets first try and define what an address is and then what is meant by encapsulation being careful not to conflate these two important yet independent terms.

Addressing and Naming

The first thing to recognize is that the Internet is comprised of two name spaces, what we call the Domain Name System and the Internet Address Space. These turn out to be just synonyms for each other in the context of addressing with different scope. Generally we can describe an address space as consisting of a name space with a set of identifiers within a given scope.

An address-space in a modern computer system is location-dependent but hardware-independent thanks to the virtual memory manager and “memory virtualization”. The objective of course is to present a logical address space which is larger than the physical memory space in order to give the illusion to each process that it owns the entire physical address space. This is a very important indirection mechanism, if we didn’t have this, applications would have to share a much smaller set of available memory. Does anyone remember DOS?

“Another problem with TCP/IP is that the real name of an application is not the text form that humans type; it’s an IP address and its well-known port number. As if an application name were a macro for a jump point through a well-known low memory address. – Professor John Day”

Binding a service, which needs to be re-locatable to a location-dependent address, is why we have such problems with mobility today (in fact we may even conclude that we are missing a layer).  Given the size and failure rates of today’s modern data-centers this problem also impacts the reliability of the services and applications consumers are so dependent on in todays web-scale companies.

So while this is a very important part of OS design, its completely different from how the Internet works because the address system we use today has no such indirection without breaking the architecture (i.e. NATS, Load Balancers, etc).

If this is true, is the IP address system currently used on the Internet “location-dependent”?  Well actually IP addresses were distributed as a “location-independent” name, not an address.  There are current attempts to correct this such as LISP, HIP as well as “BeyondIP” solutions such as RINA.

 So it turns out the root of the problem in relation to addressing is that we don’t have the right level of indirection because according to Saltzer and Day, we need a “location-independent” name to identify the application or service but all we have is a location-dependent address which is just a symbolic name!.

What is encapsulation?

Object Oriented Programming refers to encapsulation as a pattern by which [“the object’s data is contained and hidden in the object and access to it restricted to members of that class”]. In networking we use encapsulation to define the different layers of the protocol stack, which, as we know “hides” the data from members not in the Layer, in this way the protocol model forms the “hour-glass” shape minimizing the interface and encapsulating the implementation.

Sidebar Leaky Abstractions

Of course this isn’t completely true as the current protocol model of TCP/IP is subject to a “leaky-abstraction”. For instance there is no reason for the TCP logic to dive into the IP frame to read the TOS data structure, doing so would be a “Layer Violation” but we know that TCP reaches into IP to compute the pseudo header checksum. This rule can be dismissed if we think of TCP/IP as actually one layer as it was before 1978. But the reality of the broken address architecture leads to the “middle boxes” which must violate the layers in order to rewrite the appropriate structures to stitch back together the connection.

So how does encapsulation help?

In networking we use encapsulations all the time..

 We essentially encapsulate the data structures which need to be isolated (the invariants) with some other tag, header, etc. in order to hide the implementation. So in 802.1Q we use the C-TAG to denote a broadcast domain or VLAN, in VXLAN we encapsulate the host within a completely new IP shell in order to “bridge” it across without leaking the protocol primitives necessary for the host stack to process within a hypervisors stack.

From the blog.. “encapsulation provides the closest analog to the hierarchical memory virtualization in compute”

So in the context of a “hierarchy” yes we encapsulate to hide but not for the same reasons we have memory hierarchies (i.e. SRAM(cache) and DRAM). This generalization is where the blog post goes south.

So really what is the root of the problem and how is SDN an approach to solve it?

From an earlier statement we need a “location-independent” name to identify the application or service but all we have is a location-dependent address which is just a symbolic name!. If we go back to Saltzer we see that’s only part of the problem as we need a few more address/names and the binding services to accomplish that.

 One interesting example to this is the implementation of Serval from Mike Freedman at Princeton University. Serval actually breaks the binding between the application/service name and the inter-networking address..(Although there are deeper problems then this since we seem to be missing a network layer somewhere). Serval accomplishes this through the manipulation of forwarding tables via OpenFlow although it can be adapted to use any programmable interface if one exists. Another example is the NDN Project led by Van Jacobson

In summary

Yes it is unfair to conflate “Network Virtualization” with “OS Virtualization” as they deal with a different level of abstraction, state and purpose. Just as hypervisors were invented to “simulate” a hardware platform there is the need to “simulate” or abstract the network in order to build higher-level services and simplify the interface (not necessarily the implementation). In fact a case can be made that “OS Virtualization” may eventually diminish in importance as we find better mechanisms for dealing with isolation and protection of the host stack while network virtualization will extend beyond the existing solutions and even existing protocols allowing us to take on a new set of challenges. This is what makes SDN so important; not the implementation but the interface. Once we have this interface, which is protocol independent, we can start to look at fixing the really hard problems in networking in a large scale way..

NodeFlow: An OpenFlow Controller Node Style

In less you’ve been under a rock lately, you might have heard something about Software Defined Networks, OpenFlow, Network Virtualization and Control Plane/Data Plane separation.

Some of the reasons for the interest might be:

  • Evolution of the system architecture as a whole (Network, NIC, PCIE, QPI, CPU, Memory) along with X86_64 instructions, OS, drivers, software and applications have allowed for many services to run on a single host including network services. Extending the network domain into the host allows for customizable tagging, classification, load balancing and routing, with the utopia being ubiquitous control of logical and physical by a combination if in-protocol state, forwarding tables and a distributed control system.
  • Non-experimental network pathologies, which are causing havoc with large-scale systems. Turns out there are some very “real” problems, which were never part of Ethernet and TCP/IP design space and software allows us to experiment with different ideas on how to solve these problems.
  • Leveraging a possibly untapped design space in order to be differential,  leap frog competition or disrupt the marketplace

So what is OpenFlow? Well according to the Open Networking Foundation:

OpenFlow is an open standard that enables researchers to run experimental protocols in the campus networks we use every day”

This paradigm shift into the guts of the network might be better explained by a surgical assessment of the network core, its protocol structure, the devices, which deal with enrollment, classification, multiplexing/demultiplexing, flow control and routing but this will be a post for another day.

In the meantime the “network” has evolved into a first class citizen amongst infrastructure architects, software developers and consumers alike. No, I am not talking about the Social Network by big boy Zuck, but the fact that networks are finding them selves ingrained in almost anything not nailed down. This so called “Internet of Things” tells us that soon the network will be stitched into our lives through the air and into our clothes.

There are many arguments about the value of OpenFlow and SDN, but to find the benefits and use-cases the network domain experts may find the current toolsets and platforms a bit impenetrable. The current controller implementations are written in a combination of C, Python and Java and because of the “asynchronous” nature of the OF protocol, additional libraries have to be leveraged including Twisted and NIO which make it more difficult to understand exactly what is going on.

To that end I introduce NodeFlow, an OpenFlow controller written in pure JavaScript for Node.JS.  Node.JS provides an asynchronous library over JavaScript for server side programming which is perfect for writing network based applications (ones that don’t require an excessive amount of CPU).

NodeFlow is actually a very simple program and relies heavily on a protocol interpreter called OFLIB-NODE written by Zoltan LaJos Kis. I have a forked version of this library (see below) which have been tested with OpenFlow version 1.0.

Sidebar: A note on OpenFlow

Even though the Open Networking Forum has ratified the 1.2 protocol specification, we have yet to see a reference design which allows developers to experiment. In order to get a grasp of the programming model and data structures to this end I have concentrated on the most common implementation of OpenFlow 1.0. in OpenVSwitch.

Sidebar: Why Node.JS

Node.JS has become one of the most watched repos in GitHub and is headed up by the brilliant guys at Joyent. Anyone interested should check out Bryan Cantrill’s presentation  Building a Real-Time Cloud Analytics Service with Node.js

Setting up the development environment

Leveraging OpenVSwitch and tools such as MiniNet, anyone can create a simulated network environment within their own local machine. Instructions on how to setup the development environment can be seen here Download and Get Started with Mininet

Code review

We first setup the network server with a simple call to net.createServer, which we provide the port and address to listen on. The address and port are configured through a separate start script.

NodeFlowServer.prototype.start = function(address, port) {
var self = this

var socket = []
var server = net.createServer()

server.listen(port, address, function(err, result) {
util.log("NodeFlow Controller listening on " + address + ':' + port)
self.emit('started', { "Config": server.address() })

The next step provides the event listeners for socket maintenance, creates a unique sessionID from which we can keep track of each of the different switch connections and our main event process loop which is called every time we receive data on our socket channel. We use a stream library to buffer the data and return us the OpenFlow decoded message in the msgs object. We make a simple check on the message structure and then pass it on for further processing.

server.on('connection', function(socket) {
    socket.setNoDelay(noDelay = true)
    var sessionID = socket.remoteAddress + ":" + socket.remotePort
    sessions[sessionID] = new sessionKeeper(socket)
    util.log("Connection from : " + sessionID)

socket.on('data', function(data) {
    var msgs = switchStream.process(data);
    msgs.forEach(function(msg) {
    if (msg.hasOwnProperty('message')) {
         self._processMessage(msg, sessionID)
    } else {
         util.log('Error: Message is unparseable')

In the last section we leverage Node.JS EventEmitters to trigger our logic using anonymous callbacks. These event handlers wait for the specific event to happen and then trigger processing. We handle three specific events just for this initial release: ‘OFPT_PACKET_IN which is the main event to listen on for PACKET_IN events, and ‘SENDPACKET’ which simply encodes and sends our OF message on the wire.

self.on('OFPT_PACKET_IN', function(obj) {
 var packet = decode.decodeethernet(, 0)
 nfutils.do_l2_learning(obj, packet)
 self._forward_l2_packet(obj, packet)

 self.on('SENDPACKET', function(obj) {
 nfutils.sendPacket(obj.type, obj.packet.outmessage, obj.packet.sessionID)

The “Hello World” of OpenFlow controllers simply provide a learning bridge function. Here below is the implementation, which is fundamentally a Python port of NOX Pyswitch.

do_l2_learning: function(obj, packet) {
 self = this

var dl_src = packet.shost
 var dl_dst = packet.dhost
 var in_port = obj.message.body.in_port
 var dpid = obj.dpid

if (dl_src == 'ff:ff:ff:ff:ff:ff') {

if (!l2table.hasOwnProperty(dpid)) {
 l2table[dpid] = new Object() //create object
if (l2table[dpid].hasOwnProperty(dl_src)) {
 var dst = l2table[dpid][dl_src]
     if (dst != in_port) {
       util.log("MAC has moved from " + dst + " to " + in_port)
     } else {
} else {
     util.log("learned mac " + dl_src + " port : " + in_port)
     l2table[dpid][dl_src] = in_port
 if (debug) {


Alright, so seriously why the big deal.. There are other implementations which do the same thing, so why is NodeFlow so interesting. Well if we look at setting up a Flow Modification, which is what gets instantiated in the switch-forwarding table, you see we can see every element in JSON notation thanks to the OFLIB-NODE Library. This is very important as deciphering the TLV based protocol from a normative reference can be dizzying at best.

setFlowModPacket: function(obj, packet, in_port, out_port) {

var dl_dst = packet.dhost
var dl_src = packet.shost
var flow = self.extractFlow(packet)

flow.in_port = in_port

return {
 message: {
   version: 0x01,
     header: {
       type: 'OFPT_FLOW_MOD',
       xid: obj.message.header.xid
     body: {
       command: 'OFPFC_ADD',
       hard_timeout: 0,
       idle_timeout: 100,
       priority: 0x8000,
       buffer_id: obj.message.body.buffer_id,
       out_port: 'OFPP_NONE',
       flags: ['OFPFF_SEND_FLOW_REM'],
       match: {
         header: {
         type: 'OFPMT_STANDARD'
         body: {
           'wildcards': 0,
           'in_port': flow.in_port,
           'dl_src': flow.dl_src,
           'dl_dst': flow.dl_dst,
           'dl_vlan': flow.dl_vlan,
           'dl_vlan_pcp': flow.dl_vlan_pcp,
           'dl_type': flow.dl_type,
           'nw_proto': flow.nw_proto,
           'nw_src': flow.nw_src,
           'nw_dst': flow.nw_dst,
           'tp_src': flow.tp_src,
           'tp_dst': flow.tp_dst,
       actions: {
         header: {
           type: 'OFPAT_OUTPUT'
         body: {
           port: out_port


Performance and Benchmarking

So I used Cbench to compare NOX vs. NodeFlow and here are the results.

NOX [./nox_core -i ptcp: pytutorial]

NOX c++ [./nox_core -i ptcp: switch]:

NodeFlow [running with Debug: False]:

C based Controller:

As you can see from the numbers NodeFlow can handle almost 2X what NOX can do and is much more deterministic. Maxing out at 4600 rsp/sec is not shabby on a VirtualBox VM on my Mac Air!


At just under 500 LOC this prototype implementation of an OF controller is orders of magnitude less than comparable systems. Leveraging JavaScript and the high performance V8 engine allows for network architects to experiment with various SDN features without the need to deal with all of the boilerplate code required for setting up event driven programming. Hope someone gets inspired by this and takes a closer look at Node.JS for network programming.

So how do I get NodeFlow?

NodeFlow is an experimental system available at GitHub: git:// along with my fork of the OFLIB-NODE libraries here: git:// If you would like to contribute or have any questions please contact me via Twitter @gbatcisco

Special thanks to Zoltan LaJos Kis for his great OFLIB-NODE library for which this work couldn’t have been done and Matthew Ranney for his network decoder library node-pcap.

Cisco UCS “Cloud In A Box”: Terabyte Processing In RealTime

Now I hate using the term “Cloud” for anything these days but in the latest blog entry from Shay Hassidim, Deputy CTO of Gigaspaces Terabyte Elastic Cache clusters on Cisco UCS and Amazon EC2 the Cisco UCS 260 took the place of 16 Amazon High-Memory Quadruple Extra Large Instance. With 16:1 scaling imagine what you can do with a rack of these, in other words forget about Hadoop, lets go real-time data grid enabled networking!

With 40 Intel cores and 1TB of memory available to Gigaspaces XAP high performance In Memory Data Grid the system achieved an astounding 500,000 Ops/sec on 1024B POJO, the system could load 1 Billion objects in just under 36 minutes.

Now this might not sound extraordinary, but when you consider how to build an application where the bottleneck on a 40 core, 1TB system is CPU and Memory bound, properly deal with failures and have automation and instrumentation, you can’t beat this kind of system. Gigaspaces is also integrated into Cisco UCS XML-API for dynamic scaling of hardware resources.

Eventually people will catch on that memory is critical for dealing with “Big Data” and it’s no longer an issue of reliability or cost. Without disk rotational latency in the way and poor random access we can push the limits of our compute assets while leveraging the network for scale. Eventually we might see a fusion of in-memory data grids with network in a way, which allows us to deal with permutation traffic patterns by changing the dynamics of networking, processing and storage.

A View from the Developer Nation


By: Gary Berger

Every year the folks at Trifork and InfoQ put on a conference in San Francisco (expanding to other geographies) for enterprise software developers. The conference is geared towards team leads, architects, product managers and developers of all skill sets. This conference is different than most because the constituents are all practitioners (i.e. software developers) not academics, and the agenda spans from Architecture to Agile and Cloud. Many of the leading players in the industry attend, including Twitter, NetFlix, Amazon and Facebook. The discussions are usually very intriguing and provide glimpse into the complex ecosystems of these successful companies.

After being involved in this community for several years I can say that the level of maturity continues to accelerate as developers become experienced with new toolkits, new languages and are encouraged to fail fast through agile techniques like Lean Development, Kanban and Agile. Open Source continues to cross pollinate software developers allowing them to iterate through solutions find what works and keep producing minimal viable products on schedule. The fast innovation in this industry is the culmination of many geniuses over time, paying forward with their hard work and dedication so others can benefit and you can see the influences across architectures starting to emerge.

This paper does not intend to be a comprehensive text on software development practices, tools and techniques. I will however try and distill common themes and highlight specifics, which I find and where others might want to dig deeper to investigate on their own behalf.

JAVA as a Platform

It’s been several years now since I have been advocating the use of JAVA as the abstraction layer for modern applications. I have been exposed to JAVA for many years including designing an infrastructure for one of the world’s first equity matching kernels built in Java. At the time we leveraged Solid Data (DRAM) based storage and one of the first Object Databases to store our transactions, We leveraged CORBA for inter-process messaging and Netscape HTTP server for our front-end. It was a complicated set of interacting components that needed to work together flawlessly to be successful. The system itself was completely over engineered, a standard design philosophy in those days, and performed perfectly for several years before the business could not be sustained any longer. Turned out the “Mom and Pop” investor wasn’t interested in trading after-hours and there just wasn’t enough liquidity to sustain a marketplace.

Needless to say JAVA is everywhere and has recently become poster child for the nascent Platform as a Service model.  Even companies like Yammer or falling back to JAVA from Scala because of the inconsistent build tools and developer education. To be fair, its also the complexities that come from PolyGlot programming so take it or leave it. A great view on the practicality of JAVA from Erik Onnen here: Don’t forget to check out Oracle’s Public Cloud offering here:

Why JAVA? Why Now?

About a year ago now, Rod Johnson proclaimed, “Spring is the new cloud operating system”.  Even though the term “cloud” has been overloaded to the extent it’s almost superfluous, it was a profound statement when centered on the JVM and the JAVA language. For more information see Alex Buckley’s presentation on JAVA futures here:

Sidebar: Rod’s Key Note

Rod Johnson is the General Manager of SpringSource, a company acquired by VMWare to expand their footprint into the core of application development. Rod usually speaks about the evolution of the Spring Framework his gadgets such as Spring Roo and Aspect Oriented programming, but this time he was somewhat reflective on his career and his life. From the beginning, his industry acclaimed book on Java development practices turned into a massively successful Open Source project called the Spring Framework from which the first real threat to the JEE ecosystem was felt. Spring, and its philosophy around dependency injection are mainstream and his teams have a catalog of products under the SpringSource banner. Rod has had an astonishing career and talks enthusiastically about the potential for entrepreneurs to change the world through software. See Rod’s presentation here:

Oracle acquired Sun Microsystems in early 2009 and the Java community was rocked. Hints of legal battles over Java with Google left people to assess the impact on the Java ecosystem. Even the OpenJDK project the open source incarnation of the JAVA platforms survival was put in question.

Luckily, Oracle has caught on that JAVA is the way forward, and is now marketing the JEE as a “Platform as a Service”. Clearly it is the breadth and depth of the JAVA platform, which makes it successful, with a large set of toolkits, frameworks and libraries a highly educated developer community, the JAVA platform is everywhere. From Google to NetFlix and Twitter, JAVA can be found running some of the worlds most successful systems.

JAVA is now Oracles future. From Fusion to Exalogic and the JEE platform itself taking on a full service model, JAVA continues to have a profound impact on the industry.

Sidebar: The “Application Memory Wall” and its impact on JAVA

When trying to understand the scalability limits of a system we sometimes have to look under the covers at the intent of the application and how the design leverages the core resources of a system. Gil Tene from Azul Systems boils down the problem this way. We all know that the memory wall exists and is determined by the memory bandwidth and latency which evolves at a much slower pace than CPU’s. However we have yet to breach this wall thanks to the phenomenal engineering of modern chips such as Intel Nehalem instead we have been constrained by how applications are designed. This actually creates several issues, as infrastructure designers may “architect” the right solution to the wrong problem.

In the Java world the most restrictive function is Garbage Collection, the process of removing dead objects, recouping memory and compacting the fragmented heap. This process impacts all major applications and teams spend man years tweaking the available parameters on their collectors to push the problem as far into the future as they can, but alas it is unavoidable, the dreaded Stop-the-world-pause. Developers are must be very conscious of their code, how they organize their object models in order to limit the overhead, which every object has on the heap.  (See Attila Szegedi from Twitters presentation

Azul Systems has some of the brightest minds in systems design, memory management and garbage collection and have built a business to solve this specific problem. Currently the most widely used Garbage Collector (GC) is the Concurrent Mark Sweep (CMS) collector in the HotSpot JVM. CMS is classified as a Monolithic Stop-the-World collector, which must provide a full-lock to catch up with mutations (i.e. changes started during GC) and compaction.

The C4 collector from Zing has recently been upgraded to version 5, which no longer requires a hypervisor for its page-flipping techniques and instead has boiled down the necessary Linux kernel changes to a kernel loadable module (trying to be pushed into upstream kernel through Managed Runtime Initiative). This frees up wasted memory and removes the overhead of a hypervisor while still providing pause-less GC for applications. Zing has many more benefits, which can be investigated offline. See Gil’s presentation here:

The JAVA ecosystem is thriving, both the improvements in the runtime and the language are becoming better at dealing with concurrency, expressiveness and performance. Despite the fact that the current incarnation of Objects has not lived up to their promise (see below), developers seem to always find away around that continuing the evolution of this platform.


For several years now the developer community has been entrenched in debate over Object Oriented Programming (OOP) vs. Functional Programming, but more specifically over the implementation of OOP methodology in Object Oriented Languages (OOL) such as JAVA.  In part, code reuse, componentization and frameworks have not lived up to their promises. Readability has for the most part been sacrificed, as encapsulation and inheritance have been somewhat abused thanks to the creativity of developers and high-pressure timelines. QCON’s “Objects on Trial” was an interesting debate over the use of Objects. Sorry to say the jury voted “guilty”. Regardless of the verdict Poly Paradigm Computing (i.e. mix of object oriented, procedural, functional, etc..) will play a big role in application development for the same reason why different languages are better for different purposes. This means that app developers need to transparently work across these styles, picking up on the strengths such as reusable software patterns while minimizing the weaknesses such as concurrency support.

Sidebar: Ruby, Groovy, Scala, Erlang, Javascript, DART

Anyone who says language doesn’t matter when designing a system either doesn’t have a system, which cares about flexibility, performance and agility or simply doesn’t have a large enough problem to care.  All things being equal, developers will gravitate to what they are most familiar with regardless of readability, agility or the complexity it adds. Language development continues to evolve, from adding simple annotations replacing lines of boiler plate code, aspect oriented programming to syntactic sugar which makes the readability of code much better. Languages are the face of the development process, the more expressive and easier to use the faster applications can be created and the easier to maintain and extend.

Over the past several years new languages have taken hold such as Ruby, Groovy, Scala, older languages have been given a second life with growing adoption of Erlang and JavaScript and experimental languages have popped up to challenge the others such as Go and DART.

The need for “Finding the right tool for the job” gives rise to the idea of PloyGlot programming, where many different languages are utilized based on their suit for purpose. Mostly you will find JAVA+ in every shop playing a critical role of the system with a sprinkle of JavaScript for client-side coding, some Erlang for middleware maybe, Python, Groovy, C, C++, PERL, etc.. Typically you have a high-performance core written in a statically linked language like C++ responsible for all of the critically performing code, you have a plugin or module architecture to extend the core and a scripting or higher-level language which provides familiar storage primitives, collections, etc..

Frameworks such as Ruby on Rails have been a blessing to those who could leverage its rich dynamic language and templating tools for building quick web-based applications but cursed when trying to do something at a level of scale beyond the Ruby runtime. (Twitter got burned by this and moved to Scala)

For more complex scenarios such as running distributed gossip protocols with 99.999999% reliability, Erlang might be the right choice at the sacrifice of a more complex coding (See Steve Vinoski’s talk on WebMachine, an HTTP server written in Erlang).

There were a number of conversations throughout the conference regarding the importance of a type system. “A type system associates a type with each computed value[1]“. Basically a type system ensures that the value provided or returned from a function is of the expected type. This has several benefits: It allows for code readers to explicitly see what variable types are being passed around (if they don’t obscure the reference), By providing a static type the compiler can allocate the proper size memory and lastly it allows the compiler to check the validity of the variable before getting a runtime error.

Unfortunately this isn’t a panacea, the type system may get in the way especially when the primitive types are hidden making it very difficult to read the code. More on this when we talk about DART.

One of the most popular languages in software development today is JavaScript. JavaScript has gotten a bad wrap and there are many things to be weary of when it comes to writing good JavaScript. Readers should consult Douglas Crockford books on JavaScript for more information

This tiny little language is a pure Object Oriented Language (everything is an object). It provides for anonymous functions, dynamic typing, and functional reactive programming and has a host of examples from which developers can learn. Oh, and its embedded in every browser running on almost every device you can imagine. Originally JavaScript was designed as a programmatic interface for manipulating the DOM in web browsers but has found its way into server-side implementations through Node.JS and Rhino, which detach the language from the incredibly complex Document Object Model.

Sidebar: JavaScript and Node.JS

The popularity of Node cannot be understated. There is a rich set of new libraries created daily, from LDAP to Riak and all kinds of middleware to solve interesting problems. Node is merely a shim on top of JavaScript and the V8 VM which insulates some of the bad parts of the JavaScript language and confines the developer to a single-threaded model with local scoping, simplifying the design by only allowing sharing state through an IPC barrier similar to the Erlang. Node is exceptional at dealing with network-based applications supporting a variety of services such as HTTP, REST and WebSockets. Node.JS has become one of the most watched projects on GitHub and is gaining in popularity due to its rich but sometimes dangerous base language JavaScript.

As an example, here is how simple it is to create a Web server in Node.

var http = require('http');

http.createServer(function (req,rsp){
rsp.writeHead(200, {‘Content-Type’ : ‘text/plain’});
rsp.end(‘Hello World\n’);
}).listen(8080, “”);
console.log(‘Server running at’);

I will reiterate the mantra of  “having the right tool for the job” and certainly Node, V8 and JavaScript have their place centered on I/O demanding applications. The eventing model and single-threaded model is not for everyone, but it does provide a powerful tool kit and expressive language to be useful for a number of different use-cases.

Structured Web Programming, Intro to DART

While JavaScript is gaining in popularity, primarily because of the dominance of web based applications and the now future success of HTML5, CSS3 thanks to Adobe dumping flash for mobile programming, but JavaScript may not have all the answers which is why Google is investing in DART.

Google may have realized that the divergence between native applications (i.e. ones that have to be coded for iOS and Android) and web based applications, were not going to be solved by JavaScript or the ECMA community the stewards of the specification so Google created DART.

DART is being created by a number of people within Google most notably are Gilad Bracha who gave the keynote and session on DART and Lars Bak lead developer of Google V8 virtual machine.

Gilad has a long history in language development including his work on StrongTalk and co-authored of the Java Language Specification.

DART is being built as a pure object oriented language with familiar syntax such as Classes and Interfaces with a type system that doesn’t interfere with developer productivity. In its current pre-release DART script is compiled into JavaScript to be run on the V8 virtual machine. Eventually there will be a standalone DartVM (maybe part of Chromium), which will allow DART scripts to avoid this code generation step.

New languages take time to be nurtured and accepted into mainstream, but what makes DART so powerful is the rampant change in client-side development and device refreshes which might give it a great leap forward if the OSS community adopts it.

Data Management

In the field of application science (if there was such a thing), data management would be the foundational pillar to writing applications, which are reliable, flexible and engaging. In the post client/server era we are bombarded with new ways of storing, retrieving, protecting and manipulating data. The 451 Group has put together a very comprehensive look at this space, which incorporates the traditional relational database stores, the emerging NoSQL stores and the second incarnation relational stores called NewSQL. Accompanying them is also the growing and important Datagrid and Caching tier.

Lets look a moment at NoSQL since it is the hottest field in data management today.

In 1998 Carlo Strozzi uses the term NoSQL to describe his new relational database system. In 2004 at USENIX OSDI conference, Jeff Dean and Sanjay Ghemawat published their paper “MapReduce: Simplified Data Processing on Large Clusters”. It was the start of the industries changing viewpoint on data persistence and query processing. By 2007, database guru Michael Stonebraker published his paper entitled “The End of an Architectural Era (It’s Time for a Complete Rewrite). In this paper he exposes the issues around relational database models, the SQL language and challenges the industry to build highly engineered systems that will provide a 50X improvement over current products. He does however ignore the impact of commodity scale-out distributed systems. And in 2009, Eric Evans uses the term “NoSQL” in a conference in describing the emerging “non-relational, distributed data stores”.

These systems have grown from bespoke highly guarded proprietary implementations to now general use and commercially backed products such as Hadoop. While the impact of implementations such as Hadoop, Cassandra, HBase, Voldemort, etc. has yet to be fully understood, every player from EMC to Arista Networks have banked on participating in this growing field and “Big Data” has become the new buzzword in the industry.

SQL and Not Only SQL

Many large-scale shops are using and/or experimenting with a multitude of DME’s including traditional databases such as MySQL, PostGres, BerkleyDB and are starting to experiment with NoSQL implementations. MySQL, PostGres and BDB are still very much in active use and are being used in new ways, i.e. MySQL as a graph database and BDB as a Key/Value store.

Sidebar: Facebook and MySQL

Facebook uses MySQL as a graph database. The data model represents the graph by associating each row with either a vertex or an edge. This data model is abstracted into their API so that the application developer’s just deal with data in those terms, there are no indexes or joins in the data layer. In order to deal with disk bottlenecks they use a block device cache on the database servers to improve I/O especially for random reads.

Hadoop, Cassandra, MongoDB, Riak

Each of these systems has benefits and tradeoffs depending on the use-case. Application designers need to understand how their application behaves, how write oriented they are, how read oriented, how dependent on transactions or sensitivity to bandwidth and latency there application is. The reality there is “no one solution to all problems” and there is a lot of experimentation going on out there.

Hadoop has been a very successful project despite its downfalls. The issue I will highlight with Hadoop relates to the memory management and impact of garbage collection discussed a priori.

Instead I would like to focus on Cassandra. Cassandra combines the eventually consistent model of Dynamo with the data model of Google BigTable. It was created by FaceBook and transferred to the Apache Foundation where it continues to thrive with active participation and new customer adoption. At its core Cassandra is a four to five dimensional key-value store with eventual consistency. You can choose to model it as:

  • Keyspace − > Column Family − > Row − > Column − > Value
  • Keyspace − > Super Column Family − > Row − > Super Column − > Column − > Value

Cassandra provides a number of benefits including its flexible schema, multi-datacenter awareness, its ability to retrieve ordered ranges and high performance distributed writes. Several years ago sites like Twitter were discussing Cassandra it appears that there is plenty of adoption including a major push by NetFlix. See Sid Anand presentation from NetFlix here

Sidebar: Netflix

Most people know that NetFlix discontinued building in datacenters instead chose to adopt full production support on Amazon EC2. Through their partnership, NetFlix has been instrumental in shaping Amazon services and improving their design to support high performance multi-tenant infrastructure services. There is however a downside. Given the high I/O demand for some of NetFlix processes (i..e video codecs) they have to very diligent about understanding the performance envelope of their system. Because Amazon is allowed to choose where VM reside, there is a statistically significant chance that two high-demand users will wind up on the same physical gear. This has lead some to do a burn-in process to indirectly determine if there is another high-demand customer sharing the system, which could cause instability, or breach service levels. See Adrian Cockcroft presentation for more information on NetFlix and Amazon

Sidebar: A note on Riak

Riak is a distributed Key/Value store somewhat based on Amazon Dynamo. Just as Cassandra, Dynamo uses an eventual consistency model but one which allows the developer to make more discreet choices in the consistency/availably tradeoff i.e. you can choose to accept writes even under partitioning events where you may not have quorum [AP], or you can enforce consistency by making quorum mandatory [CP]). Riak handles complex processes such as, Vector clocks, Merkle trees, consistent hashing, read repair, hinted handoff, gossiping, etc., these infrastructure services are highly complex and are hidden from the application developer through a flexible API.

Riak is written in Erlang, which provides a highly reliable system for building distributed services. Erlang is designed to be highly robust in the face of failure and leverages a number of techniques to maintain reliability. But there is a price for writing in Erlang’s functional event driven style. It is either loved or hated, but the functional model and actor patterns that it embraces has also become a reference point for how new languages (see DART Isolates). Riak’s presentation can be read here:

NewSql, A blast from the past

So what in the world is NewSQL? NewSQL is the second incarnation of relational datastores, which may have more in common with the original use of the term NoSQL by Strozzi. Essentially these database engines provide the same DB semantics such as indexes, foreign keys, stored procedures and are geared towards OLTP workloads that require ACID based consistency. They borrow from a number of techniques found in NoSQL datastores such as sharding and reconciliation but are designed to be compatible with traditional DDL and SQL based language constructs. NewSQL implementations such as VoltDB and Drizzle and will play an increasing role as these types of datastores will continue to live on.  See Ryan Betts presentation here:

Reliability at Scale

Complex systems can trigger bottlenecks and exhibit near linear falloff as side effects dominate interactions. . Facebook’s Robert Johnson gave a great talk on how they use their fail-fast development process to reduce failures at scale. See Roberts presentation here: .

FaceBook has a unique position on dealing with scale with over 800M subscribers they process more requests in an hour than most Fortune 500 do in a month. For FaceBook scalability is intrinsically linked to reliability (i.e. dealing with failures). “The goal isn’t to avoid mistakes, it’s to make the costs of mistakes low”. With an aggressive approach to moving things fast, they reduce the convergence time for failure allowing them to be more consistent (sort of a high-pass filter).  By making frequent small changes, they reduce the surface area allowing them to quickly skirt around problems without having a large impact. They have honed this methodology to a craft allowing them to not only fail-fast but also encourage developers to try new things. Their goal is to run a new experiment every day, which allows them to continue to make the site engaging.

Measurement and Monitoring

Critical to all well run services is the proper use of instrumentation and monitoring. In the Internet and more specifically with Cloud, it is even more important as systems typically consist of hundreds if not thousands of independent services tied together to deliver a user facing service.  A theme emerged throughout most of the talks which center around application architectures, which was to instrument the heck out of everything. printf like logging (where the developer inserts log statements into the codebase), GC logging, application logs (i.e. webserver) are all part of a necessity to traceback the system at the time of the failure. In these systems coordinated failures (failures which coincide across many services) become more prominent and latency issues can be hidden in the inter-service calls requiring a mass amount of information. Some have hundreds of machines dedicated to instrumentation, logging and monitoring.

Sidebar: Disaster Porn and Joyent

One of the most interesting talks was by Bryan Cantrill from Joyent. Bryan is one of the original developers of DTrace at Sun Microsystems and now runs one of the largest competitors to Amazon EC2. DTrace is an expressive debugging tool built in Solaris it is also incorporated in the Open Source version of Solaris called Illumos and is a core component in MacOS. DTrace allows you to do some very sophisticated things including filtering the call stack running across every core and embedding references in dynamic languages such as JavaScript to reconcile address pointers to libraries and even to source code. Bryan demonstrated some wizardry of analyzing a core dump of a crashed Node.js application, which was simply unbelievable. It means that the pains of troubleshooting stack traces of managed languages such as Python, Ruby and JavaScript are within reach (as long as you use Illumos). See Bryan’s presentation here:

The Simian Army

So how do you deal with a business that is tied together with hundreds if not thousands of services? Each having some combinatorial effect on the other but distributed which makes understanding what is happening at any point in time almost impossible? Call in the monkeys!!

Chaos Monkey

 “The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.”[2]

What started as a methodology to literally create chaos or the worst-case situation for operations has turned into a methodology for coping with the unknown? Keeping customers happy requires a range of understanding complex interactions from the impact of a simple server failure to more exotic memory corruption errors and race conditions.

As noted earlier, modern applications are designed to be both modularized and resilient in the face of catastrophe. In a similar vane to the way packet networks were designed to continue operation on the loss of a packet, modern application services are isolated from one another so as to avoid a mass cascaded failure which would impact the entire service. In a technique similar to the way security researches Fuzz applications in search of memory corruption vulnerabilities, the Simian Army is put to work to challenge the best developers to keep their systems running in the face of utter chaos.

Latency Monkey

Hard failures are one thing but soft failures such as service degradation are harder to spot before users start to feel something is wrong. These monkeys inject random delays in client-side and server-side processing to test the ability for applications to detect and recover in a graceful way. These tests might expose more complex issues such as thundering herd and timeouts.

The Role of the Network

 The network is a critical component especially as designers gravitate towards distributed services. Whether or not you are dealing with two-phase commit or quorum based consistency models the network plays a major role in application design. Application architects must make clear tradeoffs when it comes to consistency and availability under network partitioning events and must pay clear attention to how the network flow and error control are impacting the application.

During Facebook’s presentation they were clear to post the slide to the right to the community when talking about dealing with scalability.

Incast[3][4] is a well-documented phenomenon but solutions vary from engineering larger buffers to high-resolution TCP timers. The impact for applications with coordinated “synchronization” is a complete falloff in network throughput for 100s of milliseconds. There is also another phenomenon called “port blackout” or “outcast”[5] where switches with tail-drop queues exhibit unfairness due to RTT bias. The effect is that short-flows with small RTT can be starved by longer-flows with larger RTT. Given the sacrifices that have to be made under these conditions and others. Application developers struggle with dealing with prioritizing communications. For critical writes the network must have capacity to commit in a consistent way, non crucial services such as bookkeeping writes should be relegated to a best-effort service. Today application architects solve these problems a number of ways, from moving to UDP and building error and flow control in the application layer to providing rate limiters within the application API itself.

Network designers try and solve the problems with larger buffers (which can cause instabilities itself because the error control channel is caught up in data transfer PDUs), leverage active queue management such as RED (although algorithmically flawed) and even stochastic based drop queues (i.e. random instead of tail-drop).


Application development continues to go threw a massive amount of change. We have not discusses some of the practices around Agile, Lean software development, technical debt and all of the work going into the client-side such as HTML5 and CSS. We are seeing an increasing role in dealing with infrastructures at scale and redefining the proper models and patterns for continuously reengineering to grow the business. We all know that Moore’s Law does not track the improvements in performance and our view of time shared systems doesn’t reflect the combinatorial effects if O(N^2) application interactions. Hypervisor based clouds are also becoming problematic for some. A great talk by Erik Onnen from Urban Airship talked about the difficulties in maintaining their business on EC2 including:

  • Erratic network latency
  • Strange network behavior
  • Kernel hypervisor issue
  • Undocumented limitations
  • Database scaling

This should not be a big surprised, after all Hypervisor based systems are based on multi-tiered time sharing which can be highly volatile especially for data intensive applications. Erik’s presentation can be seen here:

As application developers learn to improve their use of memory, and are given better interfaces to manage communications and quality of service it may be possible to solve some of these challenges. Certainly there is no turning back, these problems need to be fully understood and plans need to be in place so businesses can continue to use the Internet to grow their businesses, put people to work to strengthen world economies and allow innovation to transcend individual expectations to make a better world for tomorrow.


Forrester Views Cloud/Web is Outmoded and App-Internet is the new model

LeWeb 2011 George Colony, Forrester Research “Three Social Thunderstorms”

Over the past several years the word ‘Cloud’ has been used and to some extent abused  almost to the point of being superfluous. Every technology company, provider and enterprise is immersed in some sort of “cloud” project although the exact descriptions of these projects may fall short of the NIST formal definitions.  I think as technologists we tend to rebel against the status quo in attempt not just to redefine the marketplace but also to claim for our own a new path as we iterate over the current challenges for delivering new applications and services.

Just as we have overused and bludgeoned the hell out of terms like internet, virtualization and web (the prior name cloud), we are bound to move into a new set of vernacular definitions such as intercloudinterweb, fog computing  or in the case of Forrester CEO George Colony APP-Internet.

“Web and cloud are .. outmoded” concludes Mr. Colony as he goes on to explain the App-Internet as the next model offering a “faster, simpler, more immersive and a better experience”.

The thesis for this conclusion is based on the figure above where the y-axis is defined as “utilities per dollar” and the x-axis is time. P is representative of “Moores Law” and speaks to the scalability of processing power. In reality the beauty behind Moores law is lost in translation. What Moore really said was “transistors on a chip would double every year” and subsequently David House, an Intel executive at the time, noted that the changes would cause computer performance to double every 18 months [1].

If you plot transistors per chip against actual computer performance you might see a different picture due to the thermodynamic properties and manufacturing complexity of CMOS based technology not to mention the complexity in actually utilizing that hardware with todays languages, application methodologies, libraries and compilers.

S is for the growth in storage which Colony calls the “Hitachi’s Law”. This predicts that storage will double approximately every 12 months. This also is somewhat contrived as the limits of scaling magnetic medium on disk are becoming extremely difficult as we approach the limits of perpendicular recording although maybe there is some promise with the discovery of adding NaCl to the recoding process[2]. Yes we can build bigger houses with disks packed to the ceiling, but the logistics in managing such a facility is increasingly hitting the upper limits. (imagine shuffling through a facility over 100,000sqft and replacing all those failed hard drives)

N is related to the network where Colony goes on to describe the adoption rates of 3G vs 4G. First and foremost nailing down exactly what 4G is and means is an exercise in itself, as most vendors are implementing various technologies under this umbrella[3]. With an estimated 655Million people adopting 4G in its various forms by 2010[4] and the quick adoption of new mobile devices, I think this is a bit short sighted..

But there is another aspect to this which is missing which is all of the towers that collect those 3G and 4G signals need to be back-hauled into the Internet backbone. With 40GE/100GE ratified in the IEEE, I suspect the first wave of 100GE deployments to be put into production in 2012 [5]

Colony goes on to say “If your architecture was based on network you are wasting all of these improvements in processing and storage.. the center (meaning the warehouse scale datacenters such as Google, Amazon and Microsoft) is becoming more powerful and the periphery is becoming ever more powerful…

His point is valid to an extent but not because of the P, S, N curves but because now that the devices are so powerful AND we have such a robust network infrastructure we can take advantage of all of this processing power and storage available to us. Afterall if transport pricing had continue to rise as the late great Jim Gray predicted in his paper on Distributed Computing Economics [7] we would not even be having this discussion because without the distribution of data capability in the network, all we would have were some very smart expensive devices that would essentially be a fancy calculator.

To that point Colony compares todays devices with their predecessors but as stated earlier its not a fair comparison. “In 1993 the iPad 2 would have been considered one of the 30 fastest computers in the world”. Unfortunately the problem space has changed from 1993 and if we follow Parkinsons Corollary called “Jevons Paradox” or the proposition that technological progress that increases the efficiency with which a resource is used, tends to increase (rather than decrease) the rate of consumption of that resource[6] it would be hard to compare these two accurately.

So the reality is that all of these iterations, from the early ARPANET viewpoint of access to expensive time-sharing computer centers to the highly distributed and interconnected services we have today are just a succession of changes necessary to keep up with the demand for more information. Who knows what interesting changes will happen in the future but time and time again we have seen amazing strides taken to build communities and share our lives through technology.

 So lets take a closer look at the App-Internet model.

Hmm. So how is this different from todays “Web-Centric” application architecture? After all isn’t a web browser like Chrome and Safari an “application”?.

Jim Gray defined the ideal mobile task to be stateless (no database or data access), has a tiny network input and output and has a huge computational demand[7]. To be clear, his assumptions of course were that transport pricing would be rising to make the economics infeasible, but as we know the opposite effect happened as transport pricing has fallen


“Most web and data processing applications are network or state intensive and are not economically viable as mobile applications” Again the assumptions he had about telecom pricing made this prediction incorrect. He also contended that “Data loading and data scanning are cpu-intensive; but they are also data intensive and therefore are not economically viable as mobile applications. The root of is conjecture was that “the break-even point is 10,000 instructions per byte of network traffic or about a minute of computation per MB of network traffic”.

Clearly the economics and computing power has changed significantly in only a few short years. No wonder we see such paradigm shifts and restructuring of architectures and philosophies.

The fundamental characteristic which supports a “better experience” is defined as latency. We perceive latency as the responsiveness of an application to our interactions. So is he talking about the ability to process more information on intelligent edge devices? Does he not realize that a good portion of applications written for web are built with JavaScript, and that the advances in Virtual Machine technology like Google V8 is what enables all of that highly immersive and fast responding interactions? Even data loading and data scanning has improved through advances in AJAX programming and the emerging WebSockets protocol allowing for full duplex communications between the browser and the server in a common serialization format such as JSON.

There will always be a tradeoff however especially as the data we consume is not our own but other peoples. For instance, the beloved photo app in Facebook would never be possible utilizing an edge centric approach as the data actually being consumed is from someone else. There is no way to store n^2 information with all your friends from an edge device it must be centralized to an extent.

For some applications like gaming we have a high-sensitivity to latency as the interactions are very time-dependent both for the actions necessary to play the game but also how we take input for those actions through visual queues in the game itself. But if we look at examples such as OnLive which allows for lightweight endpoints to be used in highly immersive first-person gaming, clearly there is a huge dependency on the network. This is also the prescriptive approach behind Silk, although Colony talks about this in his context of App-Internet. The reality is that the Silk browser is merely a renderer. All of the heavy lifting is done on the Amazon servers and delivered over a lightweight communications framework called SPDY.

Apple has clearly dominated pushing all focus today on mobile device development. The App-Internet model is nothing more than the realization that “Applications” must be in the context of the model something which the prior “cloud” and “web” didn’t clearly articulate.

The Flash wars are over.. or are they?

 So what is the point of all of this App-Internet anyway? Well, the adoption of HTML5, CSS3, JavaScript and advanced libraries, code generations, etc.. have clearly unified web development and propelled the interface into a close to native environment. There are however some inconsistencies in the model which allows Apple to stay just one-step ahead with the look and feel of native applications. The reality is we have already been in this App-Internet model for sometime now, ever since the first XHR (XMLHttpRequest) was embedded in a page with access to a high performance JavaScript engine like V8.

So don’t be fooled, without the network we would have no ability to distribute work and handle the massive amount of data being created and shared around the world. Locality is important until its not.. at least until someone build a quantum computer network.

over and out…

  8. (Note: This is more representative as a trend rather than wholly accurate assessment of pricing)