A View from the Developer Nation


By: Gary Berger

Every year the folks at Trifork and InfoQ put on a conference in San Francisco (expanding to other geographies) for enterprise software developers. The conference is geared towards team leads, architects, product managers and developers of all skill sets. This conference is different than most because the constituents are all practitioners (i.e. software developers) not academics, and the agenda spans from Architecture to Agile and Cloud. Many of the leading players in the industry attend, including Twitter, NetFlix, Amazon and Facebook. The discussions are usually very intriguing and provide glimpse into the complex ecosystems of these successful companies.

After being involved in this community for several years I can say that the level of maturity continues to accelerate as developers become experienced with new toolkits, new languages and are encouraged to fail fast through agile techniques like Lean Development, Kanban and Agile. Open Source continues to cross pollinate software developers allowing them to iterate through solutions find what works and keep producing minimal viable products on schedule. The fast innovation in this industry is the culmination of many geniuses over time, paying forward with their hard work and dedication so others can benefit and you can see the influences across architectures starting to emerge.

This paper does not intend to be a comprehensive text on software development practices, tools and techniques. I will however try and distill common themes and highlight specifics, which I find and where others might want to dig deeper to investigate on their own behalf.

JAVA as a Platform

It’s been several years now since I have been advocating the use of JAVA as the abstraction layer for modern applications. I have been exposed to JAVA for many years including designing an infrastructure for one of the world’s first equity matching kernels built in Java. At the time we leveraged Solid Data (DRAM) based storage and one of the first Object Databases to store our transactions, We leveraged CORBA for inter-process messaging and Netscape HTTP server for our front-end. It was a complicated set of interacting components that needed to work together flawlessly to be successful. The system itself was completely over engineered, a standard design philosophy in those days, and performed perfectly for several years before the business could not be sustained any longer. Turned out the “Mom and Pop” investor wasn’t interested in trading after-hours and there just wasn’t enough liquidity to sustain a marketplace.

Needless to say JAVA is everywhere and has recently become poster child for the nascent Platform as a Service model.  Even companies like Yammer or falling back to JAVA from Scala because of the inconsistent build tools and developer education. To be fair, its also the complexities that come from PolyGlot programming so take it or leave it. A great view on the practicality of JAVA from Erik Onnen here: http://bit.ly/t2OJe2. Don’t forget to check out Oracle’s Public Cloud offering here: http://bit.ly/ubHvN0

Why JAVA? Why Now?

About a year ago now, Rod Johnson proclaimed, “Spring is the new cloud operating system”.  Even though the term “cloud” has been overloaded to the extent it’s almost superfluous, it was a profound statement when centered on the JVM and the JAVA language. For more information see Alex Buckley’s presentation on JAVA futures here: http://bit.ly/rQjlsX.

Sidebar: Rod’s Key Note

Rod Johnson is the General Manager of SpringSource, a company acquired by VMWare to expand their footprint into the core of application development. Rod usually speaks about the evolution of the Spring Framework his gadgets such as Spring Roo and Aspect Oriented programming, but this time he was somewhat reflective on his career and his life. From the beginning, his industry acclaimed book on Java development practices turned into a massively successful Open Source project called the Spring Framework from which the first real threat to the JEE ecosystem was felt. Spring, and its philosophy around dependency injection are mainstream and his teams have a catalog of products under the SpringSource banner. Rod has had an astonishing career and talks enthusiastically about the potential for entrepreneurs to change the world through software. See Rod’s presentation here: http://bit.ly/rCh7vz

Oracle acquired Sun Microsystems in early 2009 and the Java community was rocked. Hints of legal battles over Java with Google left people to assess the impact on the Java ecosystem. Even the OpenJDK project the open source incarnation of the JAVA platforms survival was put in question.

Luckily, Oracle has caught on that JAVA is the way forward, and is now marketing the JEE as a “Platform as a Service”. Clearly it is the breadth and depth of the JAVA platform, which makes it successful, with a large set of toolkits, frameworks and libraries a highly educated developer community, the JAVA platform is everywhere. From Google to NetFlix and Twitter, JAVA can be found running some of the worlds most successful systems.

JAVA is now Oracles future. From Fusion to Exalogic and the JEE platform itself taking on a full service model, JAVA continues to have a profound impact on the industry.

Sidebar: The “Application Memory Wall” and its impact on JAVA

When trying to understand the scalability limits of a system we sometimes have to look under the covers at the intent of the application and how the design leverages the core resources of a system. Gil Tene from Azul Systems boils down the problem this way. We all know that the memory wall exists and is determined by the memory bandwidth and latency which evolves at a much slower pace than CPU’s. However we have yet to breach this wall thanks to the phenomenal engineering of modern chips such as Intel Nehalem instead we have been constrained by how applications are designed. This actually creates several issues, as infrastructure designers may “architect” the right solution to the wrong problem.

In the Java world the most restrictive function is Garbage Collection, the process of removing dead objects, recouping memory and compacting the fragmented heap. This process impacts all major applications and teams spend man years tweaking the available parameters on their collectors to push the problem as far into the future as they can, but alas it is unavoidable, the dreaded Stop-the-world-pause. Developers are must be very conscious of their code, how they organize their object models in order to limit the overhead, which every object has on the heap.  (See Attila Szegedi from Twitters presentation http://bit.ly/v8OWmP).

Azul Systems has some of the brightest minds in systems design, memory management and garbage collection and have built a business to solve this specific problem. Currently the most widely used Garbage Collector (GC) is the Concurrent Mark Sweep (CMS) collector in the HotSpot JVM. CMS is classified as a Monolithic Stop-the-World collector, which must provide a full-lock to catch up with mutations (i.e. changes started during GC) and compaction.

The C4 collector from Zing has recently been upgraded to version 5, which no longer requires a hypervisor for its page-flipping techniques and instead has boiled down the necessary Linux kernel changes to a kernel loadable module (trying to be pushed into upstream kernel through Managed Runtime Initiative). This frees up wasted memory and removes the overhead of a hypervisor while still providing pause-less GC for applications. Zing has many more benefits, which can be investigated offline. See Gil’s presentation here: http://bit.ly/vN5xQ0.

The JAVA ecosystem is thriving, both the improvements in the runtime and the language are becoming better at dealing with concurrency, expressiveness and performance. Despite the fact that the current incarnation of Objects has not lived up to their promise (see below), developers seem to always find away around that continuing the evolution of this platform.


For several years now the developer community has been entrenched in debate over Object Oriented Programming (OOP) vs. Functional Programming, but more specifically over the implementation of OOP methodology in Object Oriented Languages (OOL) such as JAVA.  In part, code reuse, componentization and frameworks have not lived up to their promises. Readability has for the most part been sacrificed, as encapsulation and inheritance have been somewhat abused thanks to the creativity of developers and high-pressure timelines. QCON’s “Objects on Trial” was an interesting debate over the use of Objects. Sorry to say the jury voted “guilty”. Regardless of the verdict Poly Paradigm Computing (i.e. mix of object oriented, procedural, functional, etc..) will play a big role in application development for the same reason why different languages are better for different purposes. This means that app developers need to transparently work across these styles, picking up on the strengths such as reusable software patterns while minimizing the weaknesses such as concurrency support.

Sidebar: Ruby, Groovy, Scala, Erlang, Javascript, DART

Anyone who says language doesn’t matter when designing a system either doesn’t have a system, which cares about flexibility, performance and agility or simply doesn’t have a large enough problem to care.  All things being equal, developers will gravitate to what they are most familiar with regardless of readability, agility or the complexity it adds. Language development continues to evolve, from adding simple annotations replacing lines of boiler plate code, aspect oriented programming to syntactic sugar which makes the readability of code much better. Languages are the face of the development process, the more expressive and easier to use the faster applications can be created and the easier to maintain and extend.

Over the past several years new languages have taken hold such as Ruby, Groovy, Scala, older languages have been given a second life with growing adoption of Erlang and JavaScript and experimental languages have popped up to challenge the others such as Go and DART.

The need for “Finding the right tool for the job” gives rise to the idea of PloyGlot programming, where many different languages are utilized based on their suit for purpose. Mostly you will find JAVA+ in every shop playing a critical role of the system with a sprinkle of JavaScript for client-side coding, some Erlang for middleware maybe, Python, Groovy, C, C++, PERL, etc.. Typically you have a high-performance core written in a statically linked language like C++ responsible for all of the critically performing code, you have a plugin or module architecture to extend the core and a scripting or higher-level language which provides familiar storage primitives, collections, etc..

Frameworks such as Ruby on Rails have been a blessing to those who could leverage its rich dynamic language and templating tools for building quick web-based applications but cursed when trying to do something at a level of scale beyond the Ruby runtime. (Twitter got burned by this and moved to Scala)

For more complex scenarios such as running distributed gossip protocols with 99.999999% reliability, Erlang might be the right choice at the sacrifice of a more complex coding (See Steve Vinoski’s talk on WebMachine, an HTTP server written in Erlang).

There were a number of conversations throughout the conference regarding the importance of a type system. “A type system associates a type with each computed value[1]“. Basically a type system ensures that the value provided or returned from a function is of the expected type. This has several benefits: It allows for code readers to explicitly see what variable types are being passed around (if they don’t obscure the reference), By providing a static type the compiler can allocate the proper size memory and lastly it allows the compiler to check the validity of the variable before getting a runtime error.

Unfortunately this isn’t a panacea, the type system may get in the way especially when the primitive types are hidden making it very difficult to read the code. More on this when we talk about DART.

One of the most popular languages in software development today is JavaScript. JavaScript has gotten a bad wrap and there are many things to be weary of when it comes to writing good JavaScript. Readers should consult Douglas Crockford books on JavaScript for more information http://amzn.to/176RA4.

This tiny little language is a pure Object Oriented Language (everything is an object). It provides for anonymous functions, dynamic typing, and functional reactive programming and has a host of examples from which developers can learn. Oh, and its embedded in every browser running on almost every device you can imagine. Originally JavaScript was designed as a programmatic interface for manipulating the DOM in web browsers but has found its way into server-side implementations through Node.JS and Rhino, which detach the language from the incredibly complex Document Object Model.

Sidebar: JavaScript and Node.JS

The popularity of Node cannot be understated. There is a rich set of new libraries created daily, from LDAP to Riak and all kinds of middleware to solve interesting problems. Node is merely a shim on top of JavaScript and the V8 VM which insulates some of the bad parts of the JavaScript language and confines the developer to a single-threaded model with local scoping, simplifying the design by only allowing sharing state through an IPC barrier similar to the Erlang. Node is exceptional at dealing with network-based applications supporting a variety of services such as HTTP, REST and WebSockets. Node.JS has become one of the most watched projects on GitHub and is gaining in popularity due to its rich but sometimes dangerous base language JavaScript.

As an example, here is how simple it is to create a Web server in Node.

var http = require('http');

http.createServer(function (req,rsp){
rsp.writeHead(200, {‘Content-Type’ : ‘text/plain’});
rsp.end(‘Hello World\n’);
}).listen(8080, “”);
console.log(‘Server running at’);

I will reiterate the mantra of  “having the right tool for the job” and certainly Node, V8 and JavaScript have their place centered on I/O demanding applications. The eventing model and single-threaded model is not for everyone, but it does provide a powerful tool kit and expressive language to be useful for a number of different use-cases.

Structured Web Programming, Intro to DART

While JavaScript is gaining in popularity, primarily because of the dominance of web based applications and the now future success of HTML5, CSS3 thanks to Adobe dumping flash for mobile programming, but JavaScript may not have all the answers which is why Google is investing in DART.

Google may have realized that the divergence between native applications (i.e. ones that have to be coded for iOS and Android) and web based applications, were not going to be solved by JavaScript or the ECMA community the stewards of the specification so Google created DART.

DART is being created by a number of people within Google most notably are Gilad Bracha who gave the keynote and session on DART and Lars Bak lead developer of Google V8 virtual machine.

Gilad has a long history in language development including his work on StrongTalk and co-authored of the Java Language Specification.

DART is being built as a pure object oriented language with familiar syntax such as Classes and Interfaces with a type system that doesn’t interfere with developer productivity. In its current pre-release DART script is compiled into JavaScript to be run on the V8 virtual machine. Eventually there will be a standalone DartVM (maybe part of Chromium), which will allow DART scripts to avoid this code generation step.

New languages take time to be nurtured and accepted into mainstream, but what makes DART so powerful is the rampant change in client-side development and device refreshes which might give it a great leap forward if the OSS community adopts it.

Data Management

In the field of application science (if there was such a thing), data management would be the foundational pillar to writing applications, which are reliable, flexible and engaging. In the post client/server era we are bombarded with new ways of storing, retrieving, protecting and manipulating data. The 451 Group has put together a very comprehensive look at this space, which incorporates the traditional relational database stores, the emerging NoSQL stores and the second incarnation relational stores called NewSQL. Accompanying them is also the growing and important Datagrid and Caching tier.

Lets look a moment at NoSQL since it is the hottest field in data management today.

In 1998 Carlo Strozzi uses the term NoSQL to describe his new relational database system. In 2004 at USENIX OSDI conference, Jeff Dean and Sanjay Ghemawat published their paper “MapReduce: Simplified Data Processing on Large Clusters”. It was the start of the industries changing viewpoint on data persistence and query processing. By 2007, database guru Michael Stonebraker published his paper entitled “The End of an Architectural Era (It’s Time for a Complete Rewrite). In this paper he exposes the issues around relational database models, the SQL language and challenges the industry to build highly engineered systems that will provide a 50X improvement over current products. He does however ignore the impact of commodity scale-out distributed systems. And in 2009, Eric Evans uses the term “NoSQL” in a conference in describing the emerging “non-relational, distributed data stores”.

These systems have grown from bespoke highly guarded proprietary implementations to now general use and commercially backed products such as Hadoop. While the impact of implementations such as Hadoop, Cassandra, HBase, Voldemort, etc. has yet to be fully understood, every player from EMC to Arista Networks have banked on participating in this growing field and “Big Data” has become the new buzzword in the industry.

SQL and Not Only SQL

Many large-scale shops are using and/or experimenting with a multitude of DME’s including traditional databases such as MySQL, PostGres, BerkleyDB and are starting to experiment with NoSQL implementations. MySQL, PostGres and BDB are still very much in active use and are being used in new ways, i.e. MySQL as a graph database and BDB as a Key/Value store.

Sidebar: Facebook and MySQL

Facebook uses MySQL as a graph database. The data model represents the graph by associating each row with either a vertex or an edge. This data model is abstracted into their API so that the application developer’s just deal with data in those terms, there are no indexes or joins in the data layer. In order to deal with disk bottlenecks they use a block device cache on the database servers to improve I/O especially for random reads.

Hadoop, Cassandra, MongoDB, Riak

Each of these systems has benefits and tradeoffs depending on the use-case. Application designers need to understand how their application behaves, how write oriented they are, how read oriented, how dependent on transactions or sensitivity to bandwidth and latency there application is. The reality there is “no one solution to all problems” and there is a lot of experimentation going on out there.

Hadoop has been a very successful project despite its downfalls. The issue I will highlight with Hadoop relates to the memory management and impact of garbage collection discussed a priori.

Instead I would like to focus on Cassandra. Cassandra combines the eventually consistent model of Dynamo with the data model of Google BigTable. It was created by FaceBook and transferred to the Apache Foundation where it continues to thrive with active participation and new customer adoption. At its core Cassandra is a four to five dimensional key-value store with eventual consistency. You can choose to model it as:

  • Keyspace − > Column Family − > Row − > Column − > Value
  • Keyspace − > Super Column Family − > Row − > Super Column − > Column − > Value

Cassandra provides a number of benefits including its flexible schema, multi-datacenter awareness, its ability to retrieve ordered ranges and high performance distributed writes. Several years ago sites like Twitter were discussing Cassandra it appears that there is plenty of adoption including a major push by NetFlix. See Sid Anand presentation from NetFlix here http://bit.ly/rtSnPR)

Sidebar: Netflix

Most people know that NetFlix discontinued building in datacenters instead chose to adopt full production support on Amazon EC2. Through their partnership, NetFlix has been instrumental in shaping Amazon services and improving their design to support high performance multi-tenant infrastructure services. There is however a downside. Given the high I/O demand for some of NetFlix processes (i..e video codecs) they have to very diligent about understanding the performance envelope of their system. Because Amazon is allowed to choose where VM reside, there is a statistically significant chance that two high-demand users will wind up on the same physical gear. This has lead some to do a burn-in process to indirectly determine if there is another high-demand customer sharing the system, which could cause instability, or breach service levels. See Adrian Cockcroft presentation for more information on NetFlix and Amazon http://bit.ly/ryjft8

Sidebar: A note on Riak

Riak is a distributed Key/Value store somewhat based on Amazon Dynamo. Just as Cassandra, Dynamo uses an eventual consistency model but one which allows the developer to make more discreet choices in the consistency/availably tradeoff i.e. you can choose to accept writes even under partitioning events where you may not have quorum [AP], or you can enforce consistency by making quorum mandatory [CP]). Riak handles complex processes such as, Vector clocks, Merkle trees, consistent hashing, read repair, hinted handoff, gossiping, etc., these infrastructure services are highly complex and are hidden from the application developer through a flexible API.

Riak is written in Erlang, which provides a highly reliable system for building distributed services. Erlang is designed to be highly robust in the face of failure and leverages a number of techniques to maintain reliability. But there is a price for writing in Erlang’s functional event driven style. It is either loved or hated, but the functional model and actor patterns that it embraces has also become a reference point for how new languages (see DART Isolates). Riak’s presentation can be read here: http://bit.ly/vcoX5r.

NewSql, A blast from the past

So what in the world is NewSQL? NewSQL is the second incarnation of relational datastores, which may have more in common with the original use of the term NoSQL by Strozzi. Essentially these database engines provide the same DB semantics such as indexes, foreign keys, stored procedures and are geared towards OLTP workloads that require ACID based consistency. They borrow from a number of techniques found in NoSQL datastores such as sharding and reconciliation but are designed to be compatible with traditional DDL and SQL based language constructs. NewSQL implementations such as VoltDB and Drizzle and will play an increasing role as these types of datastores will continue to live on.  See Ryan Betts presentation here: http://bit.ly/vPHval.

Reliability at Scale

Complex systems can trigger bottlenecks and exhibit near linear falloff as side effects dominate interactions. . Facebook’s Robert Johnson gave a great talk on how they use their fail-fast development process to reduce failures at scale. See Roberts presentation here: http://goo.gl/xAqmz .

FaceBook has a unique position on dealing with scale with over 800M subscribers they process more requests in an hour than most Fortune 500 do in a month. For FaceBook scalability is intrinsically linked to reliability (i.e. dealing with failures). “The goal isn’t to avoid mistakes, it’s to make the costs of mistakes low”. With an aggressive approach to moving things fast, they reduce the convergence time for failure allowing them to be more consistent (sort of a high-pass filter).  By making frequent small changes, they reduce the surface area allowing them to quickly skirt around problems without having a large impact. They have honed this methodology to a craft allowing them to not only fail-fast but also encourage developers to try new things. Their goal is to run a new experiment every day, which allows them to continue to make the site engaging.

Measurement and Monitoring

Critical to all well run services is the proper use of instrumentation and monitoring. In the Internet and more specifically with Cloud, it is even more important as systems typically consist of hundreds if not thousands of independent services tied together to deliver a user facing service.  A theme emerged throughout most of the talks which center around application architectures, which was to instrument the heck out of everything. printf like logging (where the developer inserts log statements into the codebase), GC logging, application logs (i.e. webserver) are all part of a necessity to traceback the system at the time of the failure. In these systems coordinated failures (failures which coincide across many services) become more prominent and latency issues can be hidden in the inter-service calls requiring a mass amount of information. Some have hundreds of machines dedicated to instrumentation, logging and monitoring.

Sidebar: Disaster Porn and Joyent

One of the most interesting talks was by Bryan Cantrill from Joyent. Bryan is one of the original developers of DTrace at Sun Microsystems and now runs one of the largest competitors to Amazon EC2. DTrace is an expressive debugging tool built in Solaris it is also incorporated in the Open Source version of Solaris called Illumos and is a core component in MacOS. DTrace allows you to do some very sophisticated things including filtering the call stack running across every core and embedding references in dynamic languages such as JavaScript to reconcile address pointers to libraries and even to source code. Bryan demonstrated some wizardry of analyzing a core dump of a crashed Node.js application, which was simply unbelievable. It means that the pains of troubleshooting stack traces of managed languages such as Python, Ruby and JavaScript are within reach (as long as you use Illumos). See Bryan’s presentation here: http://bit.ly/rGfKTP

The Simian Army

So how do you deal with a business that is tied together with hundreds if not thousands of services? Each having some combinatorial effect on the other but distributed which makes understanding what is happening at any point in time almost impossible? Call in the monkeys!!

Chaos Monkey

 “The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.”[2]

What started as a methodology to literally create chaos or the worst-case situation for operations has turned into a methodology for coping with the unknown? Keeping customers happy requires a range of understanding complex interactions from the impact of a simple server failure to more exotic memory corruption errors and race conditions.

As noted earlier, modern applications are designed to be both modularized and resilient in the face of catastrophe. In a similar vane to the way packet networks were designed to continue operation on the loss of a packet, modern application services are isolated from one another so as to avoid a mass cascaded failure which would impact the entire service. In a technique similar to the way security researches Fuzz applications in search of memory corruption vulnerabilities, the Simian Army is put to work to challenge the best developers to keep their systems running in the face of utter chaos.

Latency Monkey

Hard failures are one thing but soft failures such as service degradation are harder to spot before users start to feel something is wrong. These monkeys inject random delays in client-side and server-side processing to test the ability for applications to detect and recover in a graceful way. These tests might expose more complex issues such as thundering herd and timeouts.

The Role of the Network

 The network is a critical component especially as designers gravitate towards distributed services. Whether or not you are dealing with two-phase commit or quorum based consistency models the network plays a major role in application design. Application architects must make clear tradeoffs when it comes to consistency and availability under network partitioning events and must pay clear attention to how the network flow and error control are impacting the application.

During Facebook’s presentation they were clear to post the slide to the right to the community when talking about dealing with scalability.

Incast[3][4] is a well-documented phenomenon but solutions vary from engineering larger buffers to high-resolution TCP timers. The impact for applications with coordinated “synchronization” is a complete falloff in network throughput for 100s of milliseconds. There is also another phenomenon called “port blackout” or “outcast”[5] where switches with tail-drop queues exhibit unfairness due to RTT bias. The effect is that short-flows with small RTT can be starved by longer-flows with larger RTT. Given the sacrifices that have to be made under these conditions and others. Application developers struggle with dealing with prioritizing communications. For critical writes the network must have capacity to commit in a consistent way, non crucial services such as bookkeeping writes should be relegated to a best-effort service. Today application architects solve these problems a number of ways, from moving to UDP and building error and flow control in the application layer to providing rate limiters within the application API itself.

Network designers try and solve the problems with larger buffers (which can cause instabilities itself because the error control channel is caught up in data transfer PDUs), leverage active queue management such as RED (although algorithmically flawed) and even stochastic based drop queues (i.e. random instead of tail-drop).


Application development continues to go threw a massive amount of change. We have not discusses some of the practices around Agile, Lean software development, technical debt and all of the work going into the client-side such as HTML5 and CSS. We are seeing an increasing role in dealing with infrastructures at scale and redefining the proper models and patterns for continuously reengineering to grow the business. We all know that Moore’s Law does not track the improvements in performance and our view of time shared systems doesn’t reflect the combinatorial effects if O(N^2) application interactions. Hypervisor based clouds are also becoming problematic for some. A great talk by Erik Onnen from Urban Airship talked about the difficulties in maintaining their business on EC2 including:

  • Erratic network latency
  • Strange network behavior
  • Kernel hypervisor issue
  • Undocumented limitations
  • Database scaling

This should not be a big surprised, after all Hypervisor based systems are based on multi-tiered time sharing which can be highly volatile especially for data intensive applications. Erik’s presentation can be seen here: http://bit.ly/sHZFpY.

As application developers learn to improve their use of memory, and are given better interfaces to manage communications and quality of service it may be possible to solve some of these challenges. Certainly there is no turning back, these problems need to be fully understood and plans need to be in place so businesses can continue to use the Internet to grow their businesses, put people to work to strengthen world economies and allow innovation to transcend individual expectations to make a better world for tomorrow.

[1] http://en.wikipedia.org/wiki/Type_system
[2] http://www.codinghorror.com/blog/2011/04/working-with-the-chaos-monkey.html
[3] http://dl.acm.org/citation.cfm?id=1049998
[4] http://www.pdl.cmu.edu/PDL-FTP/Storage/CMU-PDL-09-101.pdf