Sharing Science: Getting Started with The Cloud: The Ecosystem

The first in a series of hands-on articles by Allen Holub about programming for the public cloud
A friend recently reported a conversation he had with one of those wide-eyed, gee-golly developers who’s half Techie and half Moonie. When asked what he was working on, the speaker came back with, “cloud cloud cloud cloud cloud cloud,” and my friend said, “but, what if …,” to which the speaker replied, “cloud cloud cloud cloud cloud,” to which my friend said, “but that won’t work because…,” to which the developer responded, “cloud cloud cloud cloud cloud” — and so it went. Many use “cloud” as a synonym for “good.” Cloud architectures indeed have a lot going for them, but they’re not a panacea, and you need to know what you’re doing to jump to the cloud successfully.
This article both discusses general cloud-related issues and looks specifically at both the Amazon and Google cloud architectures. Subsequent articles will be more practical, delving deeply into code that comprises cloud-based applications, but let’s start with some background.
What Is The Cloud?

First, what exactly does “cloud computing” mean? The term “cloud” dates to the early days of the Internet; back before domains existed (yes, there was such a time). An email address was essentially a route specified in what was called “bang notation.” To send an email, you needed to know the name of every machine between you and the recipient. Here’s a particularly nasty example that I pulled out of an old newsgroup post:
dog.ee.lbl.gov!ucbvax!cis.ohio-state.edu!zaphod.mps.ohio-state.edu!qt.cs.utexas.edu!cs.utexas.edu!utgpu!utzoo!sq!msb
The “dog” and “msb” are the sending and receiving machines. The rest is the route from one machine to the other. The routes didn’t have to be fully specified — most email systems knew about a handful of major hubs, so a minimum-length address just specified a route from that hub to you — but there was zero flexibility.
Things changed with the introduction of domains. Instead of an explicit route, you sent an email to a gateway machine, and the email’s recipient got the mail from a different gateway machine. The servers through which the mail passed on the way from one gateway to the other were anonymous, and the network topology was unknown. The word “cloud” was coined to describe that amorphous network. You didn’t know what went on inside the cloud (and believe me, you didn’t want to). As long as the mail ended up at the right place, everything was copacetic.
So, here’s my rather strict definition of “cloud”: A network of computers arranged in an unknown topology, arranged in such a way that you don’t need to know anything about this network except how to talk to a machine at the edge.
A “cloud application” is then an application deployed to the cloud itself, not to a specific machine. The application could be running on one or more machines that may or may not be physically collocated. It’s data store could also be distributed, and may not be on the same machine as the application.
In my own mind, I see a difference between cloud applications and Web 2.0 or AJAX applications, which we used to call “thin client” applications. Here, the UI is a standalone program (typically written in JavaScript and running in the browser), and it talks to a server (which is primarily a data repository) using HTTP.
Web 2.0 applications are typically implemented using a traditional client-server architecture (one server hosted on an ISP talking to multiple browser-based clients). However, there’s no reason why you can’t have Web 2.0 cloud application. In fact, that’s most likely the way that all applications will work five years from now. It’s useful, however, to separate the concepts in your head. Most current Web 2.0 applications are not cloud based.
What Difference Does It Make?
So, why would you want a cloud application instead of a simple client/server arrangement? Consider the following ping times:
> ping www.google.cn
PING www.google.cn (203.208.37.99): 56 data bytes
64 bytes from 203.208.37.99: icmp_seq=0 ttl=239 time=273.340 ms
64 bytes from 203.208.37.99: icmp_seq=1 ttl=239 time=478.394 ms
64 bytes from 203.208.37.99: icmp_seq=2 ttl=239 time=421.920 ms
64 bytes from 203.208.37.99: icmp_seq=3 ttl=239 time=343.003 ms
64 bytes from 203.208.37.99: icmp_seq=4 ttl=239 time=263.843 ms
64 bytes from 203.208.37.99: icmp_seq=5 ttl=239 time=482.231 ms
…
The round-trip time between my desk in Berkeley, California, and one of Google’s servers in Hong Kong ranges from a bit over a quarter to almost half a second, and we’re sending only 56 bytes. There’s essentially no server overhead, but we’re hostage to both distance and the speed of the routers through which the data is passing. (A tracert reports only 17 hops, so the latency is probably all distance.) The picture is different when the server is close by. Here are the results from Berkeley to San Jose (12 hops):
> ping google.com
PING google.com (74.125.224.52): 56 data bytes
64 bytes from 74.125.224.52: icmp_seq=0 ttl=54 time=19.815 ms
64 bytes from 74.125.224.52: icmp_seq=1 ttl=54 time=20.466 ms
64 bytes from 74.125.224.52: icmp_seq=2 ttl=54 time=35.547 ms
…
A cloud application (or at least the instance of the application that we’re talking to), would ideally be running on the machine with the best access time. That’s the main advantage — the cloud can effectively reconfigure itself to take care of pesky details like network latency.
However, there’s actually no way to guarantee that this reconfiguration will actually happen, which brings us to the dark underbelly of a cloud app: We need to program for the worst case.
Imagine a cloud app that’s doing some kind of word completion. Every time you type a character, it’s sent off to a server, which finds words prefixed with whatever you’ve typed. The server sends back a list of possible matches, and your program displays these. Most of the cloud books, in fact, demonstrate this sort of thing in exactly that way — the local application talks to the server with literally every keystroke. Given the look-up times, etc., you’re user isn’t going to be particularly happy with your worst-case response time. You can, however, rethink your strategy. When the first few characters are typed, the server could send you a large, perhaps exhaustive, list of every word that could possibly start with those characters. Thereafter, the application can use that list to update its display rather than going back to the server with every key press. By eliminating the redundant network queries, we make the application much more responsive.
On the plus side, the cloud is amorphous and can indeed reconfigure itself based on observed load. If Google notices that there is a lot of traffic between Berkeley and Hong Kong, it may well replicate the Hong Kong server somewhere in California, and the latency would suddenly improve. The same applies to your cloud application: It will, ideally, be running on several geographically distributed servers, with the topology scaling to accommodate actual requests. In other words, the size of the network (and your cloud-services provider) matters. For cloud services to be effective, the provider has to be large. If you deploy to the Google or Amazon infrastructure, you’re effectively leveraging the flexibility inherent in a very large network. By my rather strict definition, an application running on a single server, whether it’s an ISP or so-called cloud host, isn’t a cloud application at all because it looses the scalability and flexible topology of a true cloud infrastructure.
Scalability and Cloud-Service Architecture
A significant advantage to a cloud infrastructure is automatic scalability, but here’s one place where the basic architecture matters. Amazon’s Elastic Compute Cloud (EC2), like most cloud providers, rents you a “virtual machine” to host your application. Your VM may or may not share a physical machine with other apps, and it has an unknown number of physical processors attached to it. At its heart, though, your EC2 VM is just a Linux (or Windows) box, and you can configure it however you want. You typically pay only for the time that the VM is actually busy doing something, which is great for a software startup that’s effectively getting rack space for free. As the load increases, so do your expenses (but hopefully, so does your revenue). You use your VM pretty much the same way you’d use a shell login to a shared server at an ISP, deploying with FTP, etc.
The inherent flexibility of a hosted-VM approach is particularly important the day that your application gets reviewed in The New York Times, and suddenly you have 1000 hits/second. Your ISP-hosted shared server would just crash at this point. An EC2 VM will scale, however, running on a dedicated machine if necessary, with cores added as necessary. Amazon will automatically increase the “umph” of your VM — giving you more machine cycles on the physical machine, for example, or assigning more cores to your application. Of course, you’ll pay for this extra umph.
The main downside of this approach is that there is an effective upper limit on the scalability. Adding cores can get you only so far, and there’s a diminishing return on the number of cores. Eventually, you’re using everything that the machine can give you. What if that’s still not enough to handle the volume? In theory, your app can be placed on several machines at this juncture, with Amazon handling the load balancing (you can run several EC2 VMs in separate physical locations that you specify), but that scaling doesn’t happen automatically, and the app has to be written with scaling in mind.
That is, if you’re really planning on scaling, you have to do exactly the same amount of programming work that you’d do if you were running the application on multiple machines in your own data center. This is a nontrivial amount of work. So, Amazon and its brethren give you a lot of flexibility in configuration. You can put anything you want on your virtual Linux box, write your app in any language, augment it with custom processes; go crazy! The downside is that you have to worry about administration and scaling, and that can add to the complexity (and cost) of the application very quickly.
Fortunately, there is another approach — the one used by the Google “App Engine.” Google doesn’t rent you a VM at all — you have no control of the operating system and can’t install arbitrary applications on “your” machine. Instead, you rent time on a virtual application server (think Tomcat). You write your application in an approved language (Python or Java) and you deploy your application directly to Google’s app server, not to the operating system. For example, if you’re using Java, your application is a standard Java “web app” packaged into a WAR file and deployed to Google exactly the same way that you’d deploy to a Tomcat instance, by uploading the WAR. Google handles the Tomcat part. (It’s not actually using Tomcat, but I usually test locally using Tomcat and haven’t found any problems. Google’s own development tools use Jetty.) Part of deploying the app is telling Google what URL to use to access it. You can use both a Google-provided URL (something.appspot.com), or a subdomain of your own domain.
I personally prefer the Google approach for several reasons. First, I hate doing system-administration work. Since Amazon just gives me a box with an OS on it, I’m forced into that role if I use EC2. Google, on the other hand, does all the SA work for me. All I need to worry about is my application. Second, Google’s virtual application server can, itself, scale and take my application along with it. For example, the app server could, at least in theory, run on multiple machines simultaneously in the same way that you can cluster Tomcat instances. Scaling, then, is in no way limited by the number of cores or speed of a single machine. As a consequence, my application can be vastly simpler, since I don’t have to deal with the scalability issues in the source code. Finally, Google provides a rich set of development tools (mostly Eclipse extensions) that ease development considerably, though many of those tools will work with EC2 as well. For example, the Google Web Toolkit (GWT) provides you with a way to build browser-agnostic AJAX front end in Java. GWT includes a Java-to-JavaScript compiler that translates your code into platform-independent JavaScript when it’s time to deploy — but when you’re developing, it’s all Java. That means that you can use the Eclipse debugger on both the client and server side, trace execution from client to server, etc., all within a single development environment. GWT applications will even run fine under Tomcat on an EC2 instance. I can develop much faster with GWT than I ever could when I was writing JavaScript by hand.
On the downside (to paraphrase Henry Ford): Your app can come in any color, provided that it’s black. Your choice of implementation language is Java (my own predilections preclude writing an enterprise application in Python). You have to structure your application as a Java web application, built around servlets, and you have to access your data using JDO or JPA (there’s no JDBC support).
Google’s working on adding SQL (due to be released within the next few months), but it’s not there yet, and is available only to “App Engine For Business” customers. Unfortunately, Google’s pricing model for the “For Business” customers effectively makes SQL inaccessible to a standard web application meant as a public Software-As-A-Service (SaaS) app. Google charges $8 per year per user for a “For Business” application, which makes sense if you’re implementing your HR application on Google instead of running it in your own data center. But a per-user fee is nonsensical if you’re writing a SaaS app to expose to the entire Internet. The standard (not “For Business) App Engine charges are based on CPU and data usage, not the per-user model. Google has made similarly stupid (a technical term we analysts use) choices on other fronts as well. For example, A standard App Engine application can use SSL only if you deploy the page to a Google URL (MyDomain.appspot.com), which could be disconcerting to one of your users if they look at the address bar. Similarly, though you can host subdomains on the Google App Engine, you cannot host your main domain on Google. You have to get an account with a standard ISP, and then redirect access to a Google-hosted subdomain. (For what it’s worth, www.foo.com is a subdomain, so it can be hosted on Google. It’s the foo.com, without thewww, that’s the problem.)
Amazon EC2, on the other hand, gives you several database choices: You can run an RDMS on your VM, you can use Amazon’s Relational Database Service (RDS), or you can use Amazon’s SimpleDB service if you’re doing something very simple. You can easily host your domain on an EC2 instance, and you can easily access that domain using SSL (because you’re just accessing your own instance of Apache, running on the VM).
So, to sum up the differences before moving on to other issues: Google provides a better programming environment, with easy deployment, and very good scalability; but, Google’s services are marred by an inability to easily host your top-level domain, inability to use SSL with your domain’s URL, and lack of SQL support. The last two can be resolved if you’re an “App Engine For Business” user, but the pricing model for that service effectively makes it useful only for large companies who want to move in-house applications from their own data centers to Google, something that I have a hard time believing will happen.) Amazon has none of those particular problems, but system administration is difficult with EC2, and scalability is not fully automated. It’s the scalability issue that’s the show stopper for me, so I’m using the Google App Engine in spite of its limitations.

Sharing Science

Pages

Category

Entri Populer

Blog Roll

Wednesday, 30 March 2011

Getting Started with The Cloud: The Ecosystem

0 komentar:

Post a Comment

Translate

Followers