Category Archives: Data Center

To the land of plenty.. Moving towards high-performance cluster management


“Jevons Paradox” is the proposition that as technology progresses (invention->innovation->diffusion), the increase in efficiency with which a resource is used tends to increase the rate of consumption of that resource.

As much as cloud operators continue to increase the population of hardware assets, it has become an increasingly difficult problem to efficiently utilize those resources effectively as demand grows. This has huge implications in the longevity of these multi-million dollar cloud warehouses highlighting the need to make better decisions on resource allocation and assignment.

Into the light..

Some promising work comes from Christine Delimitrou described in her paper “Quasar: Resource-Efficient and QoS-Aware Cluster Management

Quasar is a follow-up to the work on Paragon, a system to leverage collaborative filtering to characterize (classify) applications in terms of heterogeneity and potential for interference. Quasar establishes a set of interfaces which expand upon Paragons’ classifier.  These interfaces allow for choices to be made in scaling such as the amount of resources per server or the amount of servers per workload. Both Paragon and Quasar use offline sampling (profiling) instead of relying on some explicit characteristics but Quasar goes further in applying jointly handling resource allocation and assignment. Quasar is part of a broader set of cluster management platforms such as Omega, Borg, Mesos, which are being used in production in some of the largest web properties on the Internet.

Quasar exports a high level interfere to meet different performance constraints such as:

  • Latency critical workloads use a combination of QPS (Queries Per Second) and latency
  • Distributed frameworks use execution time
  • Single and Multi-threaded applications can use IPS (Instructions Per Second)

This work has a lot of promise given the increasing demand for efficient allocation of infrastructure resources. There continues to be an iterative cycle between application developers and infrastructure teams to mitigate the risk of failure while increasing utilization. But how does one decide which variables and how many must be used to decide on which resources to assign?

Large shops like Facebook,Twitter and Google have been experimenting with cluster scheduling for years. Systems like Omega grew out of the complexity of managing flexible scheduling with ever increasing linear complexity spawned from their explosive growth. As reported in the Quasar paper, sophisticated frameworks like Borg and Mesos have a hard time driving more than 20% aggregate CPU utilization and can under estimate resource reservations by 5x and over estimate reservations by as much as 10x. Its important to note that these numbers are at the high-end with a majority of cloud data centers and enterprise customers experiencing only a fraction of the available capacity they have invested in.

As can be seen by the following graphic, Not only are jobs completing faster with the Quasar scheduler but CPU utilization is increasingly higher which could increase the usefulness of a data center by several years having dramatic cost savings for the large web-scale data centers.


It is no secret in todays “application centric economy” that huge benefits can be obtained through application/infrastructure cooperation. Chip designs have followed the path of adding transistors to deal with complex problems such as matrix multiplication, stream processing, virtualization and high-speed I/O. infrastructure vendors have started to focus on the shifting operational models which have manifested in areas such as cloud computing, DevOps, Network Virtualization and Software Defined Networking.

The allocation and assignment of resources becomes a critical decision point which must be reacted to not in human scale but in machine scale.. The dominant force here centers around “Reactive Design” and the need for operational stability.

But who is responsible for coordinating resources, resolving shared resource conflicts in a highly dynamic environment?

Send in the Conductor.. blogging3

Orchestration describes the automated arrangement, coordination, and management of complex computer systems, middleware, and services that are used to align business or operational request with applications, data and infrastructure within a management domain [ref].

Orchestration can be broken into roughly 9 categories including: Allocation, Assignment, Scheduling, Visualization, Monitoring, Modeling,  Discovery, Packaging and Deployment.

These become fundamental building blocks for building distributed systems and allows us to talk about these functions with a clear set of vocabulary.

Allocation: Determining the appropriate resources to satisfy the performance objective at the lowest cost

Assignment: The process of selecting the appropriate resources which satisfy the resource allocation

Scheduling: Enables an allocated resource to be configured automatically for application use, manages the resource(s) for the duration of the task to secure compliance and restores the resource to its original state for future use

Visualization: The process of rendering information related to service availability, performance and security

Monitoring: Provides visibility into the state of resources and notifies applications and infrastructure management serves of changes in state

Discovery: The realization of a resource or service through observation, active probing or enrollment

Modeling: Describes available resources and their capabilities, dependencies, behaviors and relationships as a policy. Can also be used to describe composition of resources and services (i.e. happens-before relationships)

Packaging: The process of collecting all artifacts and dependancies into a portable container which can be transferred across resources. This packaging might also encapsulate existing state for instance in live migrations.

Deployment: Code and data need to be instantiated into a system in order for the scheduler to reserve resources. Delivering the packages mentioned above across resources requires coordination as to not overwhelm the network during updates.

When driving for high-performance for customers and high-efficiency for operators resource allocation and assignment become critical decision processes in the orchestration system. Quasar provides an interface which can directly relate to emerging Promise Theory allowing developers to declare scalability policies which express performance constraints allowing Quasar to search through the available option space to best fit the constraints with the available resources..

But what about the network?

“Your network is in my way..”


Everyone in the network industry is aware of James Hamilton’s observation that network technologies have long become inefficient and overly complex. SDN has driven this conversation to the forefront challenging foundational principals of the Internet such as decentralization and the end-to-end principal. The current protocol stack has a number of problems known as far back as the initial ARPAnet designs over 40 yrs ago. The Internet has become more  complex due to the distributed nature of application design and the need for location independance.

When it comes to network interference we have different opportunities to optimize for resource constraints including:

  • Path selection – Optimized to minimize distance (propagation delay)
  • Congestion and Flow Control – Optimized to maximize bandwidth
  • Error Control – Optimized to minimize loss
  • Scheduling – Optimized to maximize queue fairness amongst competing flows

This would seem to be plenty to deal with network interference except for the problem that not all flows are necessarily equal. For instance a trading application might need market data to take priority over backup replication. VOIP traffic needs to be prioritized over streaming downloads.

Unfortunately as much as we would like to have a way to map priorities across the network, the current environment makes it difficult to achieve in practice. This usually falls within the purview of Traffic Engineering and incorporates different methods for describing, distributing and acting upon flow policies either for admission control or filtering.

In a recent GIGAOM survey operators categorized Network Optimization as the leading use-case for SDN, NFV and OpenSource which might be another way of saying that we need a facility to characterize inter-process communication in a way which can be fed back into our orchestration systems to make proper resource allocation and assignment decisions.


As the industry moves through technological change (S-Curve), a rapid innovation cycle will result in many failures until we reach the point of wide adoption (diffusion). Many have speculated on the timelines but it is still far from proven how well customers will adopt not only the change that comes from technology but also the change in organizational structure, skill sets and policy.

Stack Wars and the Rise of the SDN Republic

Recently there is much insanity formed around the “SDN disruption” and the new “Stack Wars” its time to sit back and look what is going on.

The “Stack Wars” are in full swing ensnaring AMZN, VMW, MSFT, and  GOOG. With the recent aquisition of Nicira by VMWare the “Platform Wars” have exploded into an all out fight for the entire stack.  VMWare continues its dominance according to one survey suggesting a 24% lead over the next largest competitor OpenStack.  But VMWare  has yet to differentiate beyond the enterprise.

VMWare recently fired another shot from their Death Star publishing a new open source tool chain for release engineering, deployment and lifecycle management of large scale distributed services. Cloud Foundry BOSH opens up the world of poly-cloud services. According to  Steve Herrod latest post:

Cloud Foundry’s goal is to be the “Linux of the cloud.” Just as Linux provides a high degree of application portability across different hardware, Cloud Foundry provides a high level of application portability across different clouds and different cloud infrastructure. Steve Herrod, CTO VMware

So what about the EMC created and Maritz lead Project Zephyr, Both Tucci and Maritz are  tuned into the expanding market for insfrastructure and platform services expected to grow a combined $26.5B by 2016 , they must start to build a reputation outside of the Enterprise and go after the same Consumer IT market Amazon has been so successful in capturing.

OpenStack and have yet to prove their scalability and operational robustness under fire (although RACK is desperate to make Essex a success). Others are following suit recently RedHat finally pledged to the OpenStack initiative but there are still major issues in governance and fragile source based which I feel still make it a questionable platform to build your business on.

We must not forget about the >1M server Google and the massively scalable Amazon  coming downstream from Consumer IT  into the Enterprise (AMZN certainly has been going after the enterprise but not as a unification of public/private resource pools). If Larry Page and Jeff Bezos wise up they will start to offer their orchestration and management tools to use within enterprises and expand into poly-cloud control. This can benefit their bottom line  with a simple agreement of guaranteed public cloud usage which can easily be justified based on todays cloud sprawl. Having seamless access to secured and QOS aware Enterprise along with the scalability and platform richness of public clouds will shift the power to one of these heavyweights who might complete the “Death Star” and capture the “Linux of the Cloud” trophy.

SDN Disruption

In our core domain, there is a significant amount of confusion about what problems need to be solved and where. For instance having a rich set of API’s to manage infrastructure is simply a matter of economies of scale. Without lowering the average cost per unit (in terms of operations, robustness, flexibility, etc..) by means of automation, you are simply carrying an anchor around all day slowing your business. But is this SDN?

VMWare has moved into the world of SDN through the acquisition of Nicira.  VMware has been successful at virtualizing compute, storage and now networking. This can be considered the trifecta necessary to capture the ” control points” enabling them to be first in developing a unified Abstract Binary Interface to all infrastructure components. Those of you familiar with the Linux Standard Base or Single Unix Specification would recognize why this is extremely valuable in building the cloud operating system.

Each of their control points provide added value to build upon for instance the VM association to location, policy, metrics, QOS guarantees, etc.. These are incredibly valuable as is the network binding (mac->port->IP). With this information the owner can control any resource anywhere regardless of the network, operating system or the hypervisor..

Value of the Physical Network

So what about the role of the physical network.. We have heard many leaders discuss this in the context of commoditized switches, merchant silicon and proprietary fabrics.

There are significant challenges in optimizing networks especially data centers which require a mixed set of services and tenants such as  Unified Multi-Service Data Centers. There is a need for efficient topologies which maximize bi-sectional bandwidth, reduce the overhead in cabling and reduce operational complexity. The network fabric should work as well for benign traffic as it does for permutation traffic (i.e. many-to-one scenarios familiar to partition aggregate application patterns).

If I can’t utilize the full capacity of the network and be assured that I have properly scheduled workloads during permutation traffic interactions than certainly the physical network becomes an increasingly important design point. This requires changes to topology, flow control, routing and possibly the protocol architecture in order to arbitrate amongst the competing flows while maintaining low variance in delay and robustness to failures.

Realistically, the shift to 10GE network fabrics and host ports provide better scalability and to date application designers have yet to fully exploit distributed processing which means the data center traffic matrix is still fairly sparse.

As we move into the future, and workloads become more dense, one could argue that the physical network has a lot more it can accomplish. For instance ALL Fat-Tree architectures limit the available capacity of the network to the min-cut bi-sectional bandwidth. This means that overall throughput is limited to 50% of total capacity (Note: that is an ideal throughput because routing and flow control limit capacity further). The question for data center designers is will you pay for a network which they can only utilize a subset of 50% of the capacity they purchased? Well I certainly would be looking for options that would improve my cost model and this is an area which we haven’t yet found the secret sauce..

The reality is the way we architect networks today are far more efficient and offer more capacity than ever before. Load dependant bottlenecks show up way before you can exhaust the resources of the network which basically support the argument of network virtualization to reduce the amount of churn (i.e. state management) in the physical network allowing it to be more robust and reliable and predictable.

SDN to the Rescue?

The main problem today is the exhaustive manual effort in configuring all of the dependancies, dials, protocols and having to think about how things physically lay down together from the wiring to the VLAN associations, security policies etc.. This has become too cumbersome and impossible to reason about which is why overlays look so attractive. You no longer can codify or teach the network on a whiteboard, even representing all the different configuration noise on Visio’s are extremely complex and you still can’t reason holistically about network continuity, security and access control.

Reasoning about the network and applying formal verification testing before changes will allow networks to be much more predictable with much less complexity and failures. Todays switches and routers which require knowledge of complex data structures, different algorithmic complexities and interrelated dependancies cause a chain reaction of combinatorial issues. Between the link-layer and inter-domain routing there are many interactions which can go haywire and current techniques like static-analysis don’t cover the quadratic state explosion problem which exists in todays infrastructure software.

As far as SDN, its only a matter of time before the TCAM manufactures catch up to the requirements being forged in the ONF. Nick McKeown made a point in his SIGCOMM 2012 keynote that in a few short years we will power efficient TCAM’s with 100s of thousands of entries and multiple table support. Given that this is the primary bottleneck to complete the SDN ABI we will most likely see SDN become a very strong alternative to todays mix bag of control plane protocols. To be honest, rightly so.. This is not necessarily our fault but an artifact of the flawed protocol model developed at a  time where getting a character across the screen on a terminal was considered a huge step forward. This is most certainly not the world we live in anymore and unfortunately the specializations which have been built up to deal with this model are quickly being challenged..

Network Mechanics to Network Conductors

I get a general sense that the latest incarnation of network evolution (i.e. SDN) is becoming a way of expressing the frustration with dealing with a complex set of problems, which have yet to be solved. One of the things you have to ask yourself, as a network professional is “What do I really understand about the fundamentals of networking and how do I put that to use in the post-PC data hungry world?”

For years the best way to understand networking was to lug out your Network General Sniffer and watch the interaction of messages flowing across the screen. We had basic signals such as connection management; we had a general understanding of the traffic matrix by interrogating the network addresses, which we compared to our spreadsheets and some heuristics about flows. We leveraged the emerging SNMP standard to first collect traffic statistics into our pre-RRD datastores and presented pretty graphs of utilization to understand demand. Soon we had some expert systems, which would track the various protocol state transpiring between hosts and interpreting the results.

Scaling the data center meant learning about aggregation/distribution, the ratio of local traffic from remote. At the time most network engineers were taught the 80/20 rule i.e. 80% of the traffic stays local and only 20% is remote. This was a direct play on our centralized compute models, mainframes and the fact that most people were still using terminal based computing and sneakernet. It became the foundation of network design, which reflected this by oversubscribing capacity higher in the tree (i.e. Core, Distribution, Edge design).

Network automation was still in its infancy; you would use a floppy disk to update the firmware and operating system. Upgrading a Cisco router meant getting your terminal configured with the appropriate Xmodem/Zmodem settings and waiting hours while your data was serialized down a modem from the Cisco CCO BBS site.

Soon we were leveraging scripting languages like Expect and Perl to handle the complexity of managing network state across all the configuration files. Once you could use the SNMP private MIB to read and write a device configuration you could make global changes in an instant and repopulate the configurations across the world. In some ways this was all a step backwards from the advancing telecommunications control system present in the day, it was still a very closed and proprietary world leaving customers no choice but to adopt some complex and monolithic management applications.

So its 2012 and we are not much better at dealing with all of the challenges in running such a complex system as the network. IETF finally got its act together and delivered a more robust management framework through an application protocol called NetConf and an information modeling definition called YANG. Finally you can divorce the information model from the data transfer protocol and allow for a cleaner representation of the network configuration. But is this as far as we need to go? Why is SDN so interesting and what is it telling us about the still very complex problems with building, and operating networks?

As the title of the blog suggests, I think something can be said for the expertise required to manage complex systems. Question becomes, are you going to stay being a mechanic and worrying about some low-level details or are you going to be the pilot? Is it valuable to your employer for you to understand the low-level semantics of a specific implementation or rise above by creating proper interfaces to manipulate the state of the network through a reusable interface?

With information becoming more valuable than most commodities it will take a shift in mindset to move from low-level addressing concerns to traffic analysis, modeling and control. Understanding where the most important data is, how to connect to it and avoid interference will become much more important than understanding protocols.

So how does SDN contribute to this and how do we get from the complex set of tasks of setting up and operating networks to more of a fly-by-wire approach? How do we go from managing a huge set of dials and instruments to managing resources like a symphony?

The first thing to recognize is you can’t solve this problem in the network by itself!!. For years application developer’s expectations of the network were of infinite capacity and zero latency. They perceived that the flow-control capability in the network would suffice giving them ample room to pummel the network with data. Locality was far behind even an after-thought because they were developing on local machines unaware of the impact of crossing network boundaries. Networking guys use terms like latency, jitter, bandwidth, over-subscription, congestion, broadcast storms, flooding while application developer’s talk in terms of consistency, user experience, accuracy and availability.

The second thing to recognize is the network might need to be stripped down and built back up from scratch in order to further deal with its scaling challenges. In my eyes this is the clearest benefit to SDN as it highlights some of the major challenges in building and running networks. Experimenting with a complex system is disastrous; in order to break new ground it must be decomposed into its simplest form but certainly no simpler as Einstein would say. Its possible that OpenFlow has gone this route and must be redesigned into a workable set of primitive functions which can be leveraged not just through a centralized controller model but also to adapt new Operating Systems and protocols to leverage the hardware.

There is much debate over what the “best” model is here and what the objectives are. Since most networking is basically a “craft” and not a science there are those who strive to maintain the existing methodologies and mechanisms and simply open up a generalized interface to improve control. Others might see this as a mistake as if you reproduce the current broken layering model you are bound to run into a new set of challenges down the line which may require another patch, protocol or fix to solve.

Maybe an approach of looking back at the fundamentals of networking, what has been learned through the course of history, how other protocols behave and a reflective look at our industry would be valuable. How do you deal properly with connection management, data transfer efficiency, flow control? How do you leverage proper encapsulations and hierarchy to scale efficiently? What should management look like and how do you separate mechanism from policy and deliver hop-by-hop QOS?


In some regards the move towards Software Defined Network is an outcry of the frustration in managing an ever, complex set of interrelated components. Data centers have become huge information factories; servers themselves have become cluster of computers and our data hungry applications require massive amounts of parallel computing driving even more demand into the network. We could continue to take a ill-suited feature-driven approach to networking or we could take the opportunity to recognize what are the architectural principals to networking which would turn it from a craft to a science (not withstanding the argument on true science).

NodeFlow: An OpenFlow Controller Node Style

In less you’ve been under a rock lately, you might have heard something about Software Defined Networks, OpenFlow, Network Virtualization and Control Plane/Data Plane separation.

Some of the reasons for the interest might be:

  • Evolution of the system architecture as a whole (Network, NIC, PCIE, QPI, CPU, Memory) along with X86_64 instructions, OS, drivers, software and applications have allowed for many services to run on a single host including network services. Extending the network domain into the host allows for customizable tagging, classification, load balancing and routing, with the utopia being ubiquitous control of logical and physical by a combination if in-protocol state, forwarding tables and a distributed control system.
  • Non-experimental network pathologies, which are causing havoc with large-scale systems. Turns out there are some very “real” problems, which were never part of Ethernet and TCP/IP design space and software allows us to experiment with different ideas on how to solve these problems.
  • Leveraging a possibly untapped design space in order to be differential,  leap frog competition or disrupt the marketplace

So what is OpenFlow? Well according to the Open Networking Foundation:

OpenFlow is an open standard that enables researchers to run experimental protocols in the campus networks we use every day”

This paradigm shift into the guts of the network might be better explained by a surgical assessment of the network core, its protocol structure, the devices, which deal with enrollment, classification, multiplexing/demultiplexing, flow control and routing but this will be a post for another day.

In the meantime the “network” has evolved into a first class citizen amongst infrastructure architects, software developers and consumers alike. No, I am not talking about the Social Network by big boy Zuck, but the fact that networks are finding them selves ingrained in almost anything not nailed down. This so called “Internet of Things” tells us that soon the network will be stitched into our lives through the air and into our clothes.

There are many arguments about the value of OpenFlow and SDN, but to find the benefits and use-cases the network domain experts may find the current toolsets and platforms a bit impenetrable. The current controller implementations are written in a combination of C, Python and Java and because of the “asynchronous” nature of the OF protocol, additional libraries have to be leveraged including Twisted and NIO which make it more difficult to understand exactly what is going on.

To that end I introduce NodeFlow, an OpenFlow controller written in pure JavaScript for Node.JS.  Node.JS provides an asynchronous library over JavaScript for server side programming which is perfect for writing network based applications (ones that don’t require an excessive amount of CPU).

NodeFlow is actually a very simple program and relies heavily on a protocol interpreter called OFLIB-NODE written by Zoltan LaJos Kis. I have a forked version of this library (see below) which have been tested with OpenFlow version 1.0.

Sidebar: A note on OpenFlow

Even though the Open Networking Forum has ratified the 1.2 protocol specification, we have yet to see a reference design which allows developers to experiment. In order to get a grasp of the programming model and data structures to this end I have concentrated on the most common implementation of OpenFlow 1.0. in OpenVSwitch.

Sidebar: Why Node.JS

Node.JS has become one of the most watched repos in GitHub and is headed up by the brilliant guys at Joyent. Anyone interested should check out Bryan Cantrill’s presentation  Building a Real-Time Cloud Analytics Service with Node.js

Setting up the development environment

Leveraging OpenVSwitch and tools such as MiniNet, anyone can create a simulated network environment within their own local machine. Instructions on how to setup the development environment can be seen here Download and Get Started with Mininet

Code review

We first setup the network server with a simple call to net.createServer, which we provide the port and address to listen on. The address and port are configured through a separate start script.

NodeFlowServer.prototype.start = function(address, port) {
var self = this

var socket = []
var server = net.createServer()

server.listen(port, address, function(err, result) {
util.log("NodeFlow Controller listening on " + address + ':' + port)
self.emit('started', { "Config": server.address() })

The next step provides the event listeners for socket maintenance, creates a unique sessionID from which we can keep track of each of the different switch connections and our main event process loop which is called every time we receive data on our socket channel. We use a stream library to buffer the data and return us the OpenFlow decoded message in the msgs object. We make a simple check on the message structure and then pass it on for further processing.

server.on('connection', function(socket) {
    socket.setNoDelay(noDelay = true)
    var sessionID = socket.remoteAddress + ":" + socket.remotePort
    sessions[sessionID] = new sessionKeeper(socket)
    util.log("Connection from : " + sessionID)

socket.on('data', function(data) {
    var msgs = switchStream.process(data);
    msgs.forEach(function(msg) {
    if (msg.hasOwnProperty('message')) {
         self._processMessage(msg, sessionID)
    } else {
         util.log('Error: Message is unparseable')

In the last section we leverage Node.JS EventEmitters to trigger our logic using anonymous callbacks. These event handlers wait for the specific event to happen and then trigger processing. We handle three specific events just for this initial release: ‘OFPT_PACKET_IN which is the main event to listen on for PACKET_IN events, and ‘SENDPACKET’ which simply encodes and sends our OF message on the wire.

self.on('OFPT_PACKET_IN', function(obj) {
 var packet = decode.decodeethernet(, 0)
 nfutils.do_l2_learning(obj, packet)
 self._forward_l2_packet(obj, packet)

 self.on('SENDPACKET', function(obj) {
 nfutils.sendPacket(obj.type, obj.packet.outmessage, obj.packet.sessionID)

The “Hello World” of OpenFlow controllers simply provide a learning bridge function. Here below is the implementation, which is fundamentally a Python port of NOX Pyswitch.

do_l2_learning: function(obj, packet) {
 self = this

var dl_src = packet.shost
 var dl_dst = packet.dhost
 var in_port = obj.message.body.in_port
 var dpid = obj.dpid

if (dl_src == 'ff:ff:ff:ff:ff:ff') {

if (!l2table.hasOwnProperty(dpid)) {
 l2table[dpid] = new Object() //create object
if (l2table[dpid].hasOwnProperty(dl_src)) {
 var dst = l2table[dpid][dl_src]
     if (dst != in_port) {
       util.log("MAC has moved from " + dst + " to " + in_port)
     } else {
} else {
     util.log("learned mac " + dl_src + " port : " + in_port)
     l2table[dpid][dl_src] = in_port
 if (debug) {


Alright, so seriously why the big deal.. There are other implementations which do the same thing, so why is NodeFlow so interesting. Well if we look at setting up a Flow Modification, which is what gets instantiated in the switch-forwarding table, you see we can see every element in JSON notation thanks to the OFLIB-NODE Library. This is very important as deciphering the TLV based protocol from a normative reference can be dizzying at best.

setFlowModPacket: function(obj, packet, in_port, out_port) {

var dl_dst = packet.dhost
var dl_src = packet.shost
var flow = self.extractFlow(packet)

flow.in_port = in_port

return {
 message: {
   version: 0x01,
     header: {
       type: 'OFPT_FLOW_MOD',
       xid: obj.message.header.xid
     body: {
       command: 'OFPFC_ADD',
       hard_timeout: 0,
       idle_timeout: 100,
       priority: 0x8000,
       buffer_id: obj.message.body.buffer_id,
       out_port: 'OFPP_NONE',
       flags: ['OFPFF_SEND_FLOW_REM'],
       match: {
         header: {
         type: 'OFPMT_STANDARD'
         body: {
           'wildcards': 0,
           'in_port': flow.in_port,
           'dl_src': flow.dl_src,
           'dl_dst': flow.dl_dst,
           'dl_vlan': flow.dl_vlan,
           'dl_vlan_pcp': flow.dl_vlan_pcp,
           'dl_type': flow.dl_type,
           'nw_proto': flow.nw_proto,
           'nw_src': flow.nw_src,
           'nw_dst': flow.nw_dst,
           'tp_src': flow.tp_src,
           'tp_dst': flow.tp_dst,
       actions: {
         header: {
           type: 'OFPAT_OUTPUT'
         body: {
           port: out_port


Performance and Benchmarking

So I used Cbench to compare NOX vs. NodeFlow and here are the results.

NOX [./nox_core -i ptcp: pytutorial]

NOX c++ [./nox_core -i ptcp: switch]:

NodeFlow [running with Debug: False]:

C based Controller:

As you can see from the numbers NodeFlow can handle almost 2X what NOX can do and is much more deterministic. Maxing out at 4600 rsp/sec is not shabby on a VirtualBox VM on my Mac Air!


At just under 500 LOC this prototype implementation of an OF controller is orders of magnitude less than comparable systems. Leveraging JavaScript and the high performance V8 engine allows for network architects to experiment with various SDN features without the need to deal with all of the boilerplate code required for setting up event driven programming. Hope someone gets inspired by this and takes a closer look at Node.JS for network programming.

So how do I get NodeFlow?

NodeFlow is an experimental system available at GitHub: git:// along with my fork of the OFLIB-NODE libraries here: git:// If you would like to contribute or have any questions please contact me via Twitter @gbatcisco

Special thanks to Zoltan LaJos Kis for his great OFLIB-NODE library for which this work couldn’t have been done and Matthew Ranney for his network decoder library node-pcap.

Cisco UCS “Cloud In A Box”: Terabyte Processing In RealTime

Now I hate using the term “Cloud” for anything these days but in the latest blog entry from Shay Hassidim, Deputy CTO of Gigaspaces Terabyte Elastic Cache clusters on Cisco UCS and Amazon EC2 the Cisco UCS 260 took the place of 16 Amazon High-Memory Quadruple Extra Large Instance. With 16:1 scaling imagine what you can do with a rack of these, in other words forget about Hadoop, lets go real-time data grid enabled networking!

With 40 Intel cores and 1TB of memory available to Gigaspaces XAP high performance In Memory Data Grid the system achieved an astounding 500,000 Ops/sec on 1024B POJO, the system could load 1 Billion objects in just under 36 minutes.

Now this might not sound extraordinary, but when you consider how to build an application where the bottleneck on a 40 core, 1TB system is CPU and Memory bound, properly deal with failures and have automation and instrumentation, you can’t beat this kind of system. Gigaspaces is also integrated into Cisco UCS XML-API for dynamic scaling of hardware resources.

Eventually people will catch on that memory is critical for dealing with “Big Data” and it’s no longer an issue of reliability or cost. Without disk rotational latency in the way and poor random access we can push the limits of our compute assets while leveraging the network for scale. Eventually we might see a fusion of in-memory data grids with network in a way, which allows us to deal with permutation traffic patterns by changing the dynamics of networking, processing and storage.

Forrester Views Cloud/Web is Outmoded and App-Internet is the new model

LeWeb 2011 George Colony, Forrester Research “Three Social Thunderstorms”

Over the past several years the word ‘Cloud’ has been used and to some extent abused  almost to the point of being superfluous. Every technology company, provider and enterprise is immersed in some sort of “cloud” project although the exact descriptions of these projects may fall short of the NIST formal definitions.  I think as technologists we tend to rebel against the status quo in attempt not just to redefine the marketplace but also to claim for our own a new path as we iterate over the current challenges for delivering new applications and services.

Just as we have overused and bludgeoned the hell out of terms like internet, virtualization and web (the prior name cloud), we are bound to move into a new set of vernacular definitions such as intercloudinterweb, fog computing  or in the case of Forrester CEO George Colony APP-Internet.

“Web and cloud are .. outmoded” concludes Mr. Colony as he goes on to explain the App-Internet as the next model offering a “faster, simpler, more immersive and a better experience”.

The thesis for this conclusion is based on the figure above where the y-axis is defined as “utilities per dollar” and the x-axis is time. P is representative of “Moores Law” and speaks to the scalability of processing power. In reality the beauty behind Moores law is lost in translation. What Moore really said was “transistors on a chip would double every year” and subsequently David House, an Intel executive at the time, noted that the changes would cause computer performance to double every 18 months [1].

If you plot transistors per chip against actual computer performance you might see a different picture due to the thermodynamic properties and manufacturing complexity of CMOS based technology not to mention the complexity in actually utilizing that hardware with todays languages, application methodologies, libraries and compilers.

S is for the growth in storage which Colony calls the “Hitachi’s Law”. This predicts that storage will double approximately every 12 months. This also is somewhat contrived as the limits of scaling magnetic medium on disk are becoming extremely difficult as we approach the limits of perpendicular recording although maybe there is some promise with the discovery of adding NaCl to the recoding process[2]. Yes we can build bigger houses with disks packed to the ceiling, but the logistics in managing such a facility is increasingly hitting the upper limits. (imagine shuffling through a facility over 100,000sqft and replacing all those failed hard drives)

N is related to the network where Colony goes on to describe the adoption rates of 3G vs 4G. First and foremost nailing down exactly what 4G is and means is an exercise in itself, as most vendors are implementing various technologies under this umbrella[3]. With an estimated 655Million people adopting 4G in its various forms by 2010[4] and the quick adoption of new mobile devices, I think this is a bit short sighted..

But there is another aspect to this which is missing which is all of the towers that collect those 3G and 4G signals need to be back-hauled into the Internet backbone. With 40GE/100GE ratified in the IEEE, I suspect the first wave of 100GE deployments to be put into production in 2012 [5]

Colony goes on to say “If your architecture was based on network you are wasting all of these improvements in processing and storage.. the center (meaning the warehouse scale datacenters such as Google, Amazon and Microsoft) is becoming more powerful and the periphery is becoming ever more powerful…

His point is valid to an extent but not because of the P, S, N curves but because now that the devices are so powerful AND we have such a robust network infrastructure we can take advantage of all of this processing power and storage available to us. Afterall if transport pricing had continue to rise as the late great Jim Gray predicted in his paper on Distributed Computing Economics [7] we would not even be having this discussion because without the distribution of data capability in the network, all we would have were some very smart expensive devices that would essentially be a fancy calculator.

To that point Colony compares todays devices with their predecessors but as stated earlier its not a fair comparison. “In 1993 the iPad 2 would have been considered one of the 30 fastest computers in the world”. Unfortunately the problem space has changed from 1993 and if we follow Parkinsons Corollary called “Jevons Paradox” or the proposition that technological progress that increases the efficiency with which a resource is used, tends to increase (rather than decrease) the rate of consumption of that resource[6] it would be hard to compare these two accurately.

So the reality is that all of these iterations, from the early ARPANET viewpoint of access to expensive time-sharing computer centers to the highly distributed and interconnected services we have today are just a succession of changes necessary to keep up with the demand for more information. Who knows what interesting changes will happen in the future but time and time again we have seen amazing strides taken to build communities and share our lives through technology.

 So lets take a closer look at the App-Internet model.

Hmm. So how is this different from todays “Web-Centric” application architecture? After all isn’t a web browser like Chrome and Safari an “application”?.

Jim Gray defined the ideal mobile task to be stateless (no database or data access), has a tiny network input and output and has a huge computational demand[7]. To be clear, his assumptions of course were that transport pricing would be rising to make the economics infeasible, but as we know the opposite effect happened as transport pricing has fallen


“Most web and data processing applications are network or state intensive and are not economically viable as mobile applications” Again the assumptions he had about telecom pricing made this prediction incorrect. He also contended that “Data loading and data scanning are cpu-intensive; but they are also data intensive and therefore are not economically viable as mobile applications. The root of is conjecture was that “the break-even point is 10,000 instructions per byte of network traffic or about a minute of computation per MB of network traffic”.

Clearly the economics and computing power has changed significantly in only a few short years. No wonder we see such paradigm shifts and restructuring of architectures and philosophies.

The fundamental characteristic which supports a “better experience” is defined as latency. We perceive latency as the responsiveness of an application to our interactions. So is he talking about the ability to process more information on intelligent edge devices? Does he not realize that a good portion of applications written for web are built with JavaScript, and that the advances in Virtual Machine technology like Google V8 is what enables all of that highly immersive and fast responding interactions? Even data loading and data scanning has improved through advances in AJAX programming and the emerging WebSockets protocol allowing for full duplex communications between the browser and the server in a common serialization format such as JSON.

There will always be a tradeoff however especially as the data we consume is not our own but other peoples. For instance, the beloved photo app in Facebook would never be possible utilizing an edge centric approach as the data actually being consumed is from someone else. There is no way to store n^2 information with all your friends from an edge device it must be centralized to an extent.

For some applications like gaming we have a high-sensitivity to latency as the interactions are very time-dependent both for the actions necessary to play the game but also how we take input for those actions through visual queues in the game itself. But if we look at examples such as OnLive which allows for lightweight endpoints to be used in highly immersive first-person gaming, clearly there is a huge dependency on the network. This is also the prescriptive approach behind Silk, although Colony talks about this in his context of App-Internet. The reality is that the Silk browser is merely a renderer. All of the heavy lifting is done on the Amazon servers and delivered over a lightweight communications framework called SPDY.

Apple has clearly dominated pushing all focus today on mobile device development. The App-Internet model is nothing more than the realization that “Applications” must be in the context of the model something which the prior “cloud” and “web” didn’t clearly articulate.

The Flash wars are over.. or are they?

 So what is the point of all of this App-Internet anyway? Well, the adoption of HTML5, CSS3, JavaScript and advanced libraries, code generations, etc.. have clearly unified web development and propelled the interface into a close to native environment. There are however some inconsistencies in the model which allows Apple to stay just one-step ahead with the look and feel of native applications. The reality is we have already been in this App-Internet model for sometime now, ever since the first XHR (XMLHttpRequest) was embedded in a page with access to a high performance JavaScript engine like V8.

So don’t be fooled, without the network we would have no ability to distribute work and handle the massive amount of data being created and shared around the world. Locality is important until its not.. at least until someone build a quantum computer network.

over and out…

  8. (Note: This is more representative as a trend rather than wholly accurate assessment of pricing)

Cloud Networking Hyper or Reality?

A colleague of mine pointed out a new post by Jayshree Ullal from Arista Networks on Cloud Networking Reflections. I can’t help to comment on a few things for my own sanity.

Prediction #1: The rise in dense virtualization is pushing the scale of cloud networking.

Evaluation #1: True

IT is very “trend” oriented, meaning sometimes the complexity of operating a distributed system are people are too busy look deep into the problem for themselves and instead lean on the communities of marketing wizards to make a decision for them. Despite VMWare’s success, hardware virtualization makes up a very small part of the worldwide server base, which is estimated at around 32M servers [1]. I predict within a few short years a reversal in this trend, which peaked around 2008 for several reasons.

  • One is the realization that the “hardware virtualization” tax grows increasingly with I/O, a very significant problem as we move into the era of “Big Data”. The reality is as we move to more interactive and social driven applications the OS container is not as crucial as it is in a generalized client/server model. Application developers need to continuously deal with higher degrees of scalability, application flexibility, improved reliability, and faster development cycles. Using techniques like Lean software development and Continuous Delivery, application developers can get a Minimal Viable Product out the door in weeks sometimes days.
  • Two, the age of  “Many Task Computing” is upon us and will eventually sweep away the brain-dead apps and the entire overhead that comes with supporting multiple thick-containers. I say lets get down with LXC or better yet Illumos Zones, which gives us the namespace isolation without the SYSCALL overhead.
  • Three, heterogeneous computing is crucial for interactive and engaging applications. Virtualization hides this at the wrong level; we need the programming abstractions such as OpenCL/WebCL for dealing with specialization in vector programming and floating-point support via GPU’s. Even micro-servers will have a role to play here allowing a much finer grain of control while still improving power efficiency.

Its not “dense virtualization” pushing the scale of cloud networking, it is the changing patterns of the way applications are built and used. This will unfortunately continue to change the landscape of both systems design as well as network.

My Advice: Designers will finally wake up and stop being forced into this “hyper-virtualized” compute arbitrage soup and engineer application services to exploit heterogeneous computing instead of being constrained by a primitive and unnecessary abstraction layer. In the mean time, ask your developers to spend the time to build scalable platform services with proper interfaces to durable and volatile storage, memory and compute. In this way you isolate yourself from specific implementations removing the burden of supporting these runaway applications.

Prediction #2: “Fabric” has become the marketing buzzword in switching architectures from vendors trying to distinguish themselves.

Evaluation #2: Half-True.

I think the point of having “specialized” fabrics is a side effect of the scalability limits of 1990’s based network design, protocols and interconnect strategies. Specialized and proprietary fabrics have been around for years, Think Machines, Cray, SGI and Alpha all needed to deal with scalability limits connecting memory and compute together. Today’s data centers are an extension to this and have become modern super-computers connected together (i.e. a fabric)

Generally the current constraints and capabilities of technology today have forced a “rethink” on how to optimize network design for a different set of problems. There is nothing terribly shocking here unless you believe that current approaches are satisfactory. If the current architectures are satisfactory, why do we have so much confusion on whether to use L2 multi-pathing or L3 ECMP? Why is there not ONE methodology for scaling networks? Well I’ll tell you if you haven’t figured it out. Its because the current set of technologies ARE constrained and lack the capabilities necessary for truly building properly designed networks for future workloads.

The beauty of Arista’s approach is we can scale and manage two to three times better with standards. I fail to understand the need for vendor-specific proprietary tags for active multipathing when standards-based MLAG at Layer 2 or ECMP at Layer 3 (and future TRILL) resolves the challenges of scale in cloud networks. 

Scale 2x to 3x better with standards? How about 10x or better yet 50x? Really 2-3x improvement in anything is statistically insignificant and you are still left with corner cases, which absolutely grind your business to a halt. Pointing out MLAG is better than TRILL or SPB or ECMP is better than whatever is not the point. I mean really, how many tags do we need in a frame anyway and what the hell with VXLAN and NVGRE? Additional data-plane bits are not the answer, we need to rethink the layering model, address architecture and error and flow control mechanisms.

There is no solution unless you break down the problem, layer by layer until you remove all of the elements down to just the invariants. Its possible that is the direction of OpenFlow/SDN, the only problem maybe that completely destroys the layers entirely but maybe that’s the only way to build them back up the right way.

BTW. There is nothing really special about saying “standards”, after all TCP/IP itself was a rogue entry in the standards work (INWG 96) so its another accidental architecture that happened to work.. for a time!

My Advice: For those who have complete and utter autonomy, treat the DC as a giant computer which should be designed to meet the goals of your business within the capabilities and constraints of todays technology. Once you figure it out, you can use the same techniques in software to OpenSource your innovation making it generally feasible for others to enter the market (if you care about supply chain). For those who don’t, ask your vendors and standards bodies why they can’t deliver a single architecture which doesn’t continuously violate the invariances by adding tags, encaps, bits, etc..

Prediction #4: Commercially available silicon offers significant power, performance and scale benefits over traditional ASIC designs.

Evaluation #4: Very true.

Yea no surprise here, but its not as simple as just picking a chip off the shelf. When designing something as complex as an ASIC, you have to make certain tradeoffs. Feature sets build up over time, and it takes time to move back to a leaner model of primitive services with exceptional performance. There is no difference between an ASIC designer working for a fabless semiconductor company spinning out wafers from TSMC and a home grown approach, it is in the details of the design and implementation with all of the sacrifices one makes when choosing how to allocate resources.

My Advice: Don’t make decisions based on who makes the ASIC but what can be leveraged to build a balanced and flexible system. The reality is there is more to uncover than just building ASIC’s, for instance how about a simpler data plane model which would allow us to create cheaper and higher performance ASIC’s?

Prediction # 5: FCoE adoption has been slow and not lived up to its marketing hype.

Evaluation # 5: True.

“A key criterion for using 10GbE in storage applications is having switches with adequate packet buffering that can cope with speed mismatches to avoid packet loss and balance performance, “

This is also misleading as it compares FCOE with FC with 10GE sales as a way of dismissing a viable technology. But the reality is that the workload pattern changed moving the focus from interconnect to interface.

From an application development point of view, interfacing with storage at a LUN or “block” level is incredibly limited. It’s simply just not the right level of abstraction, which is why we started to move to NAS, or “file” based approaches and even converging the reemergence of content based and distributed object stores.

Believe me, developers don’t give a care if there is an FC backend or FCOE, it is irrelevant, the issue is performance. When you have a SAN based system you are dealing with a system balanced for dealing with different patterns of data access, reliability and coherency. This might be exactly what you don’t want, you may be very write intensive or read intensive and require a different set of properties than current SAN arrays provide.

The point about adding buffering to the equation not only makes things worse, but also increases the cost of the network substantially. Firstly the queues can build up very quickly especially at higher clock speeds and the impact on TCP flow-control is a serious issue. I am sure the story is not over and we will see different ways of dealing with this problem in the future. You might want to look a little closer at FC protocols and see if you can see any familiarity with TRILL.

My Advice: Forget the hype of Hadoop and concentrate on isolating the workload patterns that impact your traffic matrix. Concentrate on what the expectations of the protocols are, how to handle error and flow control, mobility, isolation, security and addressing. Develop a fundamental understanding of how to impart fair scheduling in your system to deal with demand floods, partitioning events and chaotic events. Turns out a proper “load shedding” capability can go along way in sustaining system integrity.

Yes I know, thats a lot of opaque nonsense, and while many advantages exist for businesses which choose to utilize the classical models, there are still many problems in dealing with the accidental architecture of todays networks. The future is not about what we know today, but what we can discover and learn from our mistakes once you realize we made them.

While I do work at Cisco Systems as a Technical Leader in the DC Group, these thoughts are my own and don’t necessarily represent those of my employer.