Category Archives: Enterprise Cloud

Cisco UCS “Cloud In A Box”: Terabyte Processing In RealTime

Now I hate using the term “Cloud” for anything these days but in the latest blog entry from Shay Hassidim, Deputy CTO of Gigaspaces Terabyte Elastic Cache clusters on Cisco UCS and Amazon EC2 the Cisco UCS 260 took the place of 16 Amazon High-Memory Quadruple Extra Large Instance. With 16:1 scaling imagine what you can do with a rack of these, in other words forget about Hadoop, lets go real-time data grid enabled networking!

With 40 Intel cores and 1TB of memory available to Gigaspaces XAP high performance In Memory Data Grid the system achieved an astounding 500,000 Ops/sec on 1024B POJO, the system could load 1 Billion objects in just under 36 minutes.

Now this might not sound extraordinary, but when you consider how to build an application where the bottleneck on a 40 core, 1TB system is CPU and Memory bound, properly deal with failures and have automation and instrumentation, you can’t beat this kind of system. Gigaspaces is also integrated into Cisco UCS XML-API for dynamic scaling of hardware resources.

Eventually people will catch on that memory is critical for dealing with “Big Data” and it’s no longer an issue of reliability or cost. Without disk rotational latency in the way and poor random access we can push the limits of our compute assets while leveraging the network for scale. Eventually we might see a fusion of in-memory data grids with network in a way, which allows us to deal with permutation traffic patterns by changing the dynamics of networking, processing and storage.

Forrester Views Cloud/Web is Outmoded and App-Internet is the new model

LeWeb 2011 George Colony, Forrester Research “Three Social Thunderstorms”

Over the past several years the word ‘Cloud’ has been used and to some extent abused  almost to the point of being superfluous. Every technology company, provider and enterprise is immersed in some sort of “cloud” project although the exact descriptions of these projects may fall short of the NIST formal definitions.  I think as technologists we tend to rebel against the status quo in attempt not just to redefine the marketplace but also to claim for our own a new path as we iterate over the current challenges for delivering new applications and services.

Just as we have overused and bludgeoned the hell out of terms like internet, virtualization and web (the prior name cloud), we are bound to move into a new set of vernacular definitions such as intercloudinterweb, fog computing  or in the case of Forrester CEO George Colony APP-Internet.

“Web and cloud are .. outmoded” concludes Mr. Colony as he goes on to explain the App-Internet as the next model offering a “faster, simpler, more immersive and a better experience”.

The thesis for this conclusion is based on the figure above where the y-axis is defined as “utilities per dollar” and the x-axis is time. P is representative of “Moores Law” and speaks to the scalability of processing power. In reality the beauty behind Moores law is lost in translation. What Moore really said was “transistors on a chip would double every year” and subsequently David House, an Intel executive at the time, noted that the changes would cause computer performance to double every 18 months [1].

If you plot transistors per chip against actual computer performance you might see a different picture due to the thermodynamic properties and manufacturing complexity of CMOS based technology not to mention the complexity in actually utilizing that hardware with todays languages, application methodologies, libraries and compilers.

S is for the growth in storage which Colony calls the “Hitachi’s Law”. This predicts that storage will double approximately every 12 months. This also is somewhat contrived as the limits of scaling magnetic medium on disk are becoming extremely difficult as we approach the limits of perpendicular recording although maybe there is some promise with the discovery of adding NaCl to the recoding process[2]. Yes we can build bigger houses with disks packed to the ceiling, but the logistics in managing such a facility is increasingly hitting the upper limits. (imagine shuffling through a facility over 100,000sqft and replacing all those failed hard drives)

N is related to the network where Colony goes on to describe the adoption rates of 3G vs 4G. First and foremost nailing down exactly what 4G is and means is an exercise in itself, as most vendors are implementing various technologies under this umbrella[3]. With an estimated 655Million people adopting 4G in its various forms by 2010[4] and the quick adoption of new mobile devices, I think this is a bit short sighted..

But there is another aspect to this which is missing which is all of the towers that collect those 3G and 4G signals need to be back-hauled into the Internet backbone. With 40GE/100GE ratified in the IEEE, I suspect the first wave of 100GE deployments to be put into production in 2012 [5]

Colony goes on to say “If your architecture was based on network you are wasting all of these improvements in processing and storage.. the center (meaning the warehouse scale datacenters such as Google, Amazon and Microsoft) is becoming more powerful and the periphery is becoming ever more powerful…

His point is valid to an extent but not because of the P, S, N curves but because now that the devices are so powerful AND we have such a robust network infrastructure we can take advantage of all of this processing power and storage available to us. Afterall if transport pricing had continue to rise as the late great Jim Gray predicted in his paper on Distributed Computing Economics [7] we would not even be having this discussion because without the distribution of data capability in the network, all we would have were some very smart expensive devices that would essentially be a fancy calculator.

To that point Colony compares todays devices with their predecessors but as stated earlier its not a fair comparison. “In 1993 the iPad 2 would have been considered one of the 30 fastest computers in the world”. Unfortunately the problem space has changed from 1993 and if we follow Parkinsons Corollary called “Jevons Paradox” or the proposition that technological progress that increases the efficiency with which a resource is used, tends to increase (rather than decrease) the rate of consumption of that resource[6] it would be hard to compare these two accurately.

So the reality is that all of these iterations, from the early ARPANET viewpoint of access to expensive time-sharing computer centers to the highly distributed and interconnected services we have today are just a succession of changes necessary to keep up with the demand for more information. Who knows what interesting changes will happen in the future but time and time again we have seen amazing strides taken to build communities and share our lives through technology.

 So lets take a closer look at the App-Internet model.

Hmm. So how is this different from todays “Web-Centric” application architecture? After all isn’t a web browser like Chrome and Safari an “application”?.

Jim Gray defined the ideal mobile task to be stateless (no database or data access), has a tiny network input and output and has a huge computational demand[7]. To be clear, his assumptions of course were that transport pricing would be rising to make the economics infeasible, but as we know the opposite effect happened as transport pricing has fallen

[8]

“Most web and data processing applications are network or state intensive and are not economically viable as mobile applications” Again the assumptions he had about telecom pricing made this prediction incorrect. He also contended that “Data loading and data scanning are cpu-intensive; but they are also data intensive and therefore are not economically viable as mobile applications. The root of is conjecture was that “the break-even point is 10,000 instructions per byte of network traffic or about a minute of computation per MB of network traffic”.

Clearly the economics and computing power has changed significantly in only a few short years. No wonder we see such paradigm shifts and restructuring of architectures and philosophies.

The fundamental characteristic which supports a “better experience” is defined as latency. We perceive latency as the responsiveness of an application to our interactions. So is he talking about the ability to process more information on intelligent edge devices? Does he not realize that a good portion of applications written for web are built with JavaScript, and that the advances in Virtual Machine technology like Google V8 is what enables all of that highly immersive and fast responding interactions? Even data loading and data scanning has improved through advances in AJAX programming and the emerging WebSockets protocol allowing for full duplex communications between the browser and the server in a common serialization format such as JSON.

There will always be a tradeoff however especially as the data we consume is not our own but other peoples. For instance, the beloved photo app in Facebook would never be possible utilizing an edge centric approach as the data actually being consumed is from someone else. There is no way to store n^2 information with all your friends from an edge device it must be centralized to an extent.

For some applications like gaming we have a high-sensitivity to latency as the interactions are very time-dependent both for the actions necessary to play the game but also how we take input for those actions through visual queues in the game itself. But if we look at examples such as OnLive which allows for lightweight endpoints to be used in highly immersive first-person gaming, clearly there is a huge dependency on the network. This is also the prescriptive approach behind Silk, although Colony talks about this in his context of App-Internet. The reality is that the Silk browser is merely a renderer. All of the heavy lifting is done on the Amazon servers and delivered over a lightweight communications framework called SPDY.

Apple has clearly dominated pushing all focus today on mobile device development. The App-Internet model is nothing more than the realization that “Applications” must be in the context of the model something which the prior “cloud” and “web” didn’t clearly articulate.


The Flash wars are over.. or are they?


 So what is the point of all of this App-Internet anyway? Well, the adoption of HTML5, CSS3, JavaScript and advanced libraries, code generations, etc.. have clearly unified web development and propelled the interface into a close to native environment. There are however some inconsistencies in the model which allows Apple to stay just one-step ahead with the look and feel of native applications. The reality is we have already been in this App-Internet model for sometime now, ever since the first XHR (XMLHttpRequest) was embedded in a page with access to a high performance JavaScript engine like V8.

So don’t be fooled, without the network we would have no ability to distribute work and handle the massive amount of data being created and shared around the world. Locality is important until its not.. at least until someone build a quantum computer network.

over and out…

  1. http://news.cnet.com/2100-1001-984051.html
  2. http://www.techspot.com/news/45887-researchers-using-salt-to-increase-hard-drive-capacity.html
  3. http://en.wikipedia.org/wiki/4g
  4. http://www.fiercewireless.com/story/real-world-comparing-3g-4g-speeds/2010-05-25
  5. http://www.businesswire.com/news/home/20110923005103/en/Xelerated-Begins-Volume-Production-100G-Network-Processor
  6. http://en.wikipedia.org/wiki/Jevons_paradox
  7. http://research.microsoft.com/apps/pubs/default.aspx?id=70001
  8. http://drpeering.net/white-papers/Internet-Transit-Pricing-Historical-And-Projected.php (Note: This is more representative as a trend rather than wholly accurate assessment of pricing)

Cloud Networking Hyper or Reality?

A colleague of mine pointed out a new post by Jayshree Ullal from Arista Networks on Cloud Networking Reflections. I can’t help to comment on a few things for my own sanity.

Prediction #1: The rise in dense virtualization is pushing the scale of cloud networking.

Evaluation #1: True

IT is very “trend” oriented, meaning sometimes the complexity of operating a distributed system are people are too busy look deep into the problem for themselves and instead lean on the communities of marketing wizards to make a decision for them. Despite VMWare’s success, hardware virtualization makes up a very small part of the worldwide server base, which is estimated at around 32M servers [1]. I predict within a few short years a reversal in this trend, which peaked around 2008 for several reasons.

  • One is the realization that the “hardware virtualization” tax grows increasingly with I/O, a very significant problem as we move into the era of “Big Data”. The reality is as we move to more interactive and social driven applications the OS container is not as crucial as it is in a generalized client/server model. Application developers need to continuously deal with higher degrees of scalability, application flexibility, improved reliability, and faster development cycles. Using techniques like Lean software development and Continuous Delivery, application developers can get a Minimal Viable Product out the door in weeks sometimes days.
  • Two, the age of  “Many Task Computing” is upon us and will eventually sweep away the brain-dead apps and the entire overhead that comes with supporting multiple thick-containers. I say lets get down with LXC or better yet Illumos Zones, which gives us the namespace isolation without the SYSCALL overhead.
  • Three, heterogeneous computing is crucial for interactive and engaging applications. Virtualization hides this at the wrong level; we need the programming abstractions such as OpenCL/WebCL for dealing with specialization in vector programming and floating-point support via GPU’s. Even micro-servers will have a role to play here allowing a much finer grain of control while still improving power efficiency.

Its not “dense virtualization” pushing the scale of cloud networking, it is the changing patterns of the way applications are built and used. This will unfortunately continue to change the landscape of both systems design as well as network.

My Advice: Designers will finally wake up and stop being forced into this “hyper-virtualized” compute arbitrage soup and engineer application services to exploit heterogeneous computing instead of being constrained by a primitive and unnecessary abstraction layer. In the mean time, ask your developers to spend the time to build scalable platform services with proper interfaces to durable and volatile storage, memory and compute. In this way you isolate yourself from specific implementations removing the burden of supporting these runaway applications.

Prediction #2: “Fabric” has become the marketing buzzword in switching architectures from vendors trying to distinguish themselves.

Evaluation #2: Half-True.

I think the point of having “specialized” fabrics is a side effect of the scalability limits of 1990’s based network design, protocols and interconnect strategies. Specialized and proprietary fabrics have been around for years, Think Machines, Cray, SGI and Alpha all needed to deal with scalability limits connecting memory and compute together. Today’s data centers are an extension to this and have become modern super-computers connected together (i.e. a fabric)

Generally the current constraints and capabilities of technology today have forced a “rethink” on how to optimize network design for a different set of problems. There is nothing terribly shocking here unless you believe that current approaches are satisfactory. If the current architectures are satisfactory, why do we have so much confusion on whether to use L2 multi-pathing or L3 ECMP? Why is there not ONE methodology for scaling networks? Well I’ll tell you if you haven’t figured it out. Its because the current set of technologies ARE constrained and lack the capabilities necessary for truly building properly designed networks for future workloads.

The beauty of Arista’s approach is we can scale and manage two to three times better with standards. I fail to understand the need for vendor-specific proprietary tags for active multipathing when standards-based MLAG at Layer 2 or ECMP at Layer 3 (and future TRILL) resolves the challenges of scale in cloud networks. 

Scale 2x to 3x better with standards? How about 10x or better yet 50x? Really 2-3x improvement in anything is statistically insignificant and you are still left with corner cases, which absolutely grind your business to a halt. Pointing out MLAG is better than TRILL or SPB or ECMP is better than whatever is not the point. I mean really, how many tags do we need in a frame anyway and what the hell with VXLAN and NVGRE? Additional data-plane bits are not the answer, we need to rethink the layering model, address architecture and error and flow control mechanisms.

There is no solution unless you break down the problem, layer by layer until you remove all of the elements down to just the invariants. Its possible that is the direction of OpenFlow/SDN, the only problem maybe that completely destroys the layers entirely but maybe that’s the only way to build them back up the right way.

BTW. There is nothing really special about saying “standards”, after all TCP/IP itself was a rogue entry in the standards work (INWG 96) so its another accidental architecture that happened to work.. for a time!

My Advice: For those who have complete and utter autonomy, treat the DC as a giant computer which should be designed to meet the goals of your business within the capabilities and constraints of todays technology. Once you figure it out, you can use the same techniques in software to OpenSource your innovation making it generally feasible for others to enter the market (if you care about supply chain). For those who don’t, ask your vendors and standards bodies why they can’t deliver a single architecture which doesn’t continuously violate the invariances by adding tags, encaps, bits, etc..

Prediction #4: Commercially available silicon offers significant power, performance and scale benefits over traditional ASIC designs.

Evaluation #4: Very true.

Yea no surprise here, but its not as simple as just picking a chip off the shelf. When designing something as complex as an ASIC, you have to make certain tradeoffs. Feature sets build up over time, and it takes time to move back to a leaner model of primitive services with exceptional performance. There is no difference between an ASIC designer working for a fabless semiconductor company spinning out wafers from TSMC and a home grown approach, it is in the details of the design and implementation with all of the sacrifices one makes when choosing how to allocate resources.

My Advice: Don’t make decisions based on who makes the ASIC but what can be leveraged to build a balanced and flexible system. The reality is there is more to uncover than just building ASIC’s, for instance how about a simpler data plane model which would allow us to create cheaper and higher performance ASIC’s?

Prediction # 5: FCoE adoption has been slow and not lived up to its marketing hype.

Evaluation # 5: True.

“A key criterion for using 10GbE in storage applications is having switches with adequate packet buffering that can cope with speed mismatches to avoid packet loss and balance performance, “

This is also misleading as it compares FCOE with FC with 10GE sales as a way of dismissing a viable technology. But the reality is that the workload pattern changed moving the focus from interconnect to interface.

From an application development point of view, interfacing with storage at a LUN or “block” level is incredibly limited. It’s simply just not the right level of abstraction, which is why we started to move to NAS, or “file” based approaches and even converging the reemergence of content based and distributed object stores.

Believe me, developers don’t give a care if there is an FC backend or FCOE, it is irrelevant, the issue is performance. When you have a SAN based system you are dealing with a system balanced for dealing with different patterns of data access, reliability and coherency. This might be exactly what you don’t want, you may be very write intensive or read intensive and require a different set of properties than current SAN arrays provide.

The point about adding buffering to the equation not only makes things worse, but also increases the cost of the network substantially. Firstly the queues can build up very quickly especially at higher clock speeds and the impact on TCP flow-control is a serious issue. I am sure the story is not over and we will see different ways of dealing with this problem in the future. You might want to look a little closer at FC protocols and see if you can see any familiarity with TRILL.

My Advice: Forget the hype of Hadoop and concentrate on isolating the workload patterns that impact your traffic matrix. Concentrate on what the expectations of the protocols are, how to handle error and flow control, mobility, isolation, security and addressing. Develop a fundamental understanding of how to impart fair scheduling in your system to deal with demand floods, partitioning events and chaotic events. Turns out a proper “load shedding” capability can go along way in sustaining system integrity.

Yes I know, thats a lot of opaque nonsense, and while many advantages exist for businesses which choose to utilize the classical models, there are still many problems in dealing with the accidental architecture of todays networks. The future is not about what we know today, but what we can discover and learn from our mistakes once you realize we made them.

While I do work at Cisco Systems as a Technical Leader in the DC Group, these thoughts are my own and don’t necessarily represent those of my employer.

[1] http://www.mediafire.com/file/zzqna34282frr2f/koomeydatacenterelectuse2011finalversion.pdf

 

Distributed Computing

We can all agree that we are in the midst of a shift in the practice of information technology delivery, fueled by economization, global interconnection and changes in both computer and social sciences.

Although this can be considered revolutionary change in some circumstances, it is rooted in problems known almost 20 years ago. For those of you interested in the history and a very clairvoyant look at this current shift read “A Note on Distributed Computing, “. This paper concentrates on integration of the current language model to address the issues of latency, concurrency and partitioning in distributed systems.

“They [Programmers[ look at the programming interfaces and decide that the problem is that the programming model is not close enough to whatever programming model is currently in vogue..A furious bout of language and protocol design takes place and a new distributed computing paradigm is announced that is compliant with the latest programming model. After several years, the percentage of distributed applications is discovered not to have increased significantly, and the cycle begins anew.”

This paper concludes with very specific advice:

“Differences in latency, memory access, partial failure, and concurrency make merging of the computational models of local and distributed computing both unwise to attempt and unable to succeed.”

Now there are a few things not known back in 1994 including where exactly Moores Law would take us, language development, ubiquitous device access and the scale at which the Internet has grown but when you examine the issues discovered by the likes of Google, Amazon, Facebook, etc… you recognize that the cycle has indeed begun anew.

The interesting part is the velocity of innovation to solve these problems along with the cooperative nature of open source software has fueled an even broader manifestation of change and companies of all sizes can help contribute to the greater good of open software, enabling communities of interest to develop and share information in an open, yet secure way.

[1] Waldo, Wyant, Wollrath, Kendall, Sun Microsystems, 1994, SMLI TR-94-29

Amazon on the Enterprise

Last week I attended the AWS Cloud for the Enterprise Event held in NY and was not surprised to see a massive turnout for Dr. Vogels keynote and the following customer presentations.

I have been to the last three events held in NY and each time the crowds get larger, the participants become more enthusiastic and the level of innovation continues to accelerate at a mad pace.

Some of the customer representatives included Conde’ Naste Digital including Wired.com, Nasdaq OMX, Sony Music Entertainment, and New York Life. I thought the most compelling discussion was given by Michael Gordon, First VP New York Life where he discussed how he migrated his applications onto AWS in 2 weeks while translating capital expenditures to Amazon pay as you go model.

In a time where it costs the Enterprise business upwards of 1MM to even engage the IT organization and looking at 3-6month or longer implementation time-lines Amazon is showing how web-scaling can provide an effective approach for any kind of business even a hundred year old life insurance company.

Some of the other more poignant parts of the discussion included how the application development teams had to think differently about hosting their applications on EC2. Customers have to make some changes to the application architecture when running on virtualized hosts . Some of Amazons’ technologies like EBS and SimpleDB can be used for providing persistance and an effective caching layer for key/value pairs.

As Enterprises start to incorporate similar IaaS services within their own organizations app/dev teams will surely wind up reevaluating their architecture and finding alternative ways for accomplishing performance and scalability goals.. This is definitely a key point to understand and why its important for developers and infrastructure architects to take a blank slate approach in re architecting our business applications.

Emmergence of DataGrids to solve scaling problems

There is a great post at BigDataMatters discussing the emergence of Open Source Data Grids and the introduction of Infinispan 4.0.0 Beta 1.

The Infinispan site defines data grids as:

Data grids are highly concurrent distributed data structures. They typically allow you to address a large amount of memory and store data in a way that it is quick to access. They also tend to feature low latency retrieval, and maintain adequate copies across a network to provide resilience to server failure.

In the article Chris Wilk explains some of the challenges in data grid technologies around dynamic routing.

The reason that GigaSpaces suffers from this limitation is that it has a fixed space routing table at deployment time. The above scenario was described to Manik who said that Infinispan does not suffer from this restriction as it uses dynamic routing tables. Infinispan allows you to add any number of machines without incurring any down-time.

The spreading of data across many hosts is accomplished using different techniques but the point to take here is that altering the partition routing logic in mid-stream is very destructive to supporting distributed transactions. There are also many system level aspects which create inconsistencies including garbage collection and network overhead which could jeapordize the movement of dynamic objects between partitions.

Increasing the capacity of a data-grid to provide deterministic performance , robustness and consistency should be done by running a fixed amount of partitions and “moving” partition from one JVM to another newly started JVM. With GigaSpaces you can have 10 , 50 or 200 partitions used when starting the data-grid and have these running within a small amount of JVMs, later you can increase the amount when needed (manually or dynamically). You can re-balance the system and spread the partitions across all the existing JVMs. It is up to you to determine how far you want to scale the system which means you have total control on system behavior.

The routing mechanism with GigaSpaces will function without any problems and spread data across all  partitions as long as you have more unique keys than the amount of partitions. This should not be a problem with 99.99% of the cases.

The comparison ignores many other GigaSpaces features such as Mule integration , Event handling and data processing high-level building blocks , Web container and dynamic HTTP configuration , Service management , system management tools , performance (especially for single object operations , batch operations and local cache) , text search integration , massive amount of client support , large data support (up to several Tera data ) , large object support , Map-Reduce API , Scripting languages support (Java, .NET, C, Scala , Groovy…) , Cloud API support , schema evolution , etc….

Having new players is great and verifies that there is room for new vendors in this huge market for In-Memory-Data-Grid technologies on the cloud (private/public) – But it is important also to do the right comparison.

See more here:
http://www.gigaspaces.com/wiki/display/SBP/Capacity+Planning
http://www.gigaspaces.com/wiki/display/CCF/CCF4XAP+Documentation+Home

The Reencarnation of Enterprise Architecture Cast as a “Cloud”

What have we learned from ITIL, Zachman and Togaf? Now I don’t claim to be an expert in any of these models although I have researched them in the past and while fond of their utter intent, could not really figure out how to “operationalize” them while running a major production system. I have been through Six Sigma training from my days at BOFA and found the mathematical principals along with the DMAIC work flow extremely important for IT to measure their full understanding of the business.

So today, What is Information Technology? Is the IT group organized, trained, supported in order to fully execute on the mission at hand. Can they really transform into a service driven organization where they can effectively manage cost, capacity and business flexibility?

So I am going to highlight the constructs of what I call the “Enterprise Stack”. Utilizing all of the combined intelligence and research around large scale compute designs such as grids, clusters, farms and clouds we can setup the organization to adapt correctly to different demands of the business.

Enterprise Stack

There are three axis: Organizational Alignment, Archetypical Interfaces and Service Layers.

Organizational Alignment is about shifting the organization to focus on delivering a specific set of services which allow transparent access to workloads and resources.

Archetypical Interrfaces deals with the architecture as a whole providing different technology approaches to supporting the over arching application stack.

Service Layers divide up the technology boundaries based on Organizational Alignment. It is core to the delivery model of the SBU, PBU and IBU to cleanly segregate responsibilities across the service layers.

The thing to be aware of is that their is an implied circular reference here. IBU could be a Consumer of PaaS and SaaS services fundamentally to support their business. Each IBU is both a provider and consumer of the different organizations.

Enterprise Consumer

  • Can be the business itself or an external customer
  • Service Catalog Driven
  • Service Driven Management and Pricing
  • Qualitative and Quantitative Service Level Attributes

Below is an example set of responsibilities for each BU.

SPI

Archetypical Services are based on different technology capabilities

Archetypes

It is definitely a new age and despite the nay sayers IT is in the process of transforming into a much more efficient organization which can be driven at a higher velocity of change thanks to more effective computing models.