Analysis of IT news

Saturday, September 13, 2008

On parallelilzation

A big buzzword one of these days is "cloud computing" / "grid computing", the concept of spreading heavy computing workload over a large network of computers.

That poster child for that trend is of course Google. Back in the days, Digital's AltaVista was one of the top web search services around, powered by a handful of DEC Alphas - Digital's then top-of-the-line, 64 bit Unix machines. Google used a radically opposite approach: instead of using a few powerful computers it used a network of thousands of ordinary PCs.

And several others followed suit. In 1999 the SETI@home project used spare CPU cycles from volunteer's computers all over the world to crunch numbers in the hope of finding signs of extra-terrestrial intelligence in outer space noise. Following the same model, Stanford University's Folding@Home molecular simulation program is now largely relying on people volunteering the spare processing power of their PlayStations 3. And these days, Amazon is getting more and more known for its cloud computing service offering.

Heavy parallelization is gaining ground. Move out Crays!

But if you look closely, supercomputers like Crays *are* all about parallelization. A Cray doesn't contain one super-duper mega CPU, it is composed of several racks, composed themselves of several boards which contain several CPUs running in parallel. Modern supercomputers can be composed of over 70,000 quad-core CPUs. Running just any application on a Cray won't magically make it work at top speed. It needs to be vectorized first so that several parts can run in parallel.

The same parallelization logic is standard in the hard disk industry: there is no such thing as a multi-terrabyte hard disk - you just use several hard disks in parallel, that is what RAID is all about. And even in the PC world parallelization is nothing new. Back in the early '90s, the novelty of the Pentium was to contain two i486 cores. And you can nowadays find some computer desktops with dual or quad-core processors. No matter where you look, high-performance solutions will always include some type of parallelization because that's the easiest way to push the performance envelope.

So what's the difference between a cloud computing solution and a supercomputer then?

One of the main differences is how parallelization is done. In supercomputers everything is tightly integrated and highly customized. Where Google and SETI@Home have to connect their different machines through a network (or even the Internet for the latter!), a Cray connects its CPUs through a customized high-speed bus.

Another difference is that a Cray will use top-of-the-line CPUs. On the other hand, for a while Google was even choosing Intel P3s over P4s because the former offered more processing power per watt used (heat is a big concern, whether for computer farms or for supercomputers). And as far as SETI@home is concerned, well, beggars can't be choosers and it has to deal with whatever computers volunteers have.

So far, those two differences play in the supercomputer's favor. Except for two factors:
1) The proponents of cloud computing had to work around the integration hurdle which eventually game them an edge. Google had to figure out how to distribute effectively its workload across thousands of PCs - they even had to customize some hardware to achieve this goal. In SETI@home's case, each computer is assigned a work unit to work on, so task splitting is easy. But once this is achieved, the architecture is much more scalable. Google can easily add more computers to its data centers, whereas a Cray box would need to be redesigned to allowed twice as many CPUs as it can currently hold.

2) Using off-the-shelf products such as PCs or PlayStations allow to benefit from the huge economies of scale those products enjoy. The CPU inside a PC might not be as powerful as the CPU used inside a Cray, but on a per-dollar basis it offers much more computing power. Likewise, the rise of 3D video games has driven a demand for high-performance 3D graphic cards. Because the video game market is so large, the GPUs (Graphic Processing Units) used for video games are now very powerful number crunchers for a very modest price - the downside is that they cannot do any type of operation so their use for supercomputing is still limited.

So does this mean that cloud computing will replace supercomputers? Not exactly. If you look closely you will see they live in different markets. Cloud/grid computing has caught up mainly in two markets:
- Web applications that require a scalable architecture to sustain heavy Internet traffic
- Entities with high number-crunching needs but low-budget, so who rely on spare cycles from volunteer all over the Internet.

Supercomputers on the other hand are only about teraFLOPS (trillion floating-point operations per seconds). Since the early days of the Eniac they've always been number crunchers and don't run complex applications. Supercomputers serve a very small but prized niche market: deep-pocketed, high-end customers who are in need of as much number-crunching power as money can buy. The Pentagon would be a prime example.

Now, how does cloud computing compares to supercomputing?

In 2007 the fastest supercomputer was the IBM BlueGene/L which can be configured to deliver up to 478.2 TFLOPS. Its successor, the BlueGene/P, is designed to break the petaFLOPS barrier (thousands of trillions floating point operations per second). But as of the end of 2007 the record is held by a grid computing solution: Stanford's Folding@Home project reached 1.3 petaFLOPS in September 2007. One petaFLOPS was attributed to over 670,000 PlayStations 3.

Now, what about the price? At discounted prices, purchasing 670,000 PlayStations 3 would cost over a hundred million dollars. By comparison, reaching 1 petaFLOPS with an IBM BlueGene/P would require 72 racks. At $1.4 million per rack we're talking about a $100 million price tag.

So cloud/grid computing delivers a price and a power comparable to what supercomputers provide. The downside of the former is the huge number of computers involved. Stanford might not care not owning its computing power, but some customers like the Pentagon do. And hosting 670,000 PS3 would require a lot of space and consume a lot of electricity. So a supercomputer is more practical. But on the other hand, the Folding@Home project is grossly under-using the hardware. Each individual PS3 can deliver a theoretical 2 teraFLOPS, a far cry from the 1.5 gigaFLOPS actually delivered (1 petaFLOPS divided by 670,000).

So here are the trends that I can see:
- cloud computing will evolve into supercomputers
- they will however keep their off-the-shelf CPUs
- grid computing will stay in its niche market

1) Cloud computing will evolve into supercomputers. The big downside of cloud computing is the space and energy it requires. But if you look at trends you will see a tighter and tighter integration in an effort to save mostly space but also energy and gain processing power. At first PCs were a big box on the desktop. As servers they were initially the same big box in the server room. Then they became available as racks, i.e. pizza boxes that you put on top of each other in an empty tower rack, gaining a lot of space. Google customized the hardware some more to transform an army of racks into a cloud computer. Because they had to deploy data centers all over the world, they learnt to "package" those data centers to be as easily deployed as possible. And they're not alone because Sun now sells a "modular datacenter" which is a ready-to-use cargo container packed with servers that you just plug to a company's network and power supply. Last but not least, graphic processor manufacturer Nvidia has started selling its own sort of small supercomputer, the Nvidia Telsa, which is a rack-mounted appliance containing 4 GeForce 8800GTX graphic cards and delivers an alleged 2 teraFLOPS for $12,000. It's still far from the BlueGene but that's a start and it costs only $6000 per teraFLOP (compared to $100,000 per teraFlop for the BlueGene/P).

Cloud computing won't turn into supercomputers overnight, but it will likely be based on machines which aggregate more and more processors. This will start with using quad-core or dual quad-core machines if it's not already done. 8 or 16 core CPUs is just a matter of time. Later on they might use a network of Nvidia Telsas or other computers containing a few hundreds processors. And so on and so forth. Eventually cloud computing might be just like a supercomputer which is composed of several racks, each composed of several boards.

Cloud computing solutions based on GPUs seem to be designed mostly for number crunching, and thus take aim at the current supercomputers' market, working their ways from from the lower end of the market up. Other solutions (like Google or Amazon's) are about running Web applications and thus target a different market.

2) But the big difference between the cloud computing of the future and today's supercomputers is that the former will still heavily rely on off-the-shelf components. There is indeed no incentive to drop mass-produced CPUs and GPUs, for they enjoy huge economies of scale. With over 10 million units sold, the processors inside a PS3 enjoy economies of scale components a Cray or a BlueGene could only dream of. An individual purchasing an Nvidia graphic card will enjoy the same technology as the company that purchases a Telsa supercomputer.

This also means a different segmentation of the market - a vertical segmentation to be precise. The first tier is occupied by processor manufacturers who try to produce the fastest and cheapest processors (Intel, Nvidia). The second tier is occupied by manufacturers that build a computer (or at least boards) that integrates as many of those processors as possible (still Intel and Nvidia - at least right now). And the last tier is occupied by integrators who link all these machines together into a packaged computer, possibly by customizing them a little bit (Google, Amazon). It remains to be seen what tier will provide the most value - and thus reap most of the profit of this industry.

3) One can however expect grid computing to stay alive because of its very low cost. Who wouldn't like to have a record-breaking computing capability powered mostly by other people's computers and video game consoles? There is however a limitation of grid computing: there is only so many people who will donate computing time, and it is highly dependent on word of mouth.


Post a Comment

<< Home