Tuesday, June 5, 2007

Google Developer Day - Session 2 - A Computing System for the World's Information: A Look Behind the Scenes at Google

The second true session of the day (not including the keynote) was called "A Computing System for the World's Information: A Look Behind the Scenes at Google." This session, presented by Jeff Dean, primarily focused on specific aspects of Google's infrastructure. To summarize their systems' philosophy, build a large amount of cheap machines that are expected to fail, but write extremely reliable software to gracefully handle the failures and milk the hardware as much as possible. By "cheap machines," I mean literally desktop grade hardware, no complicated expensive servers. Mr. Dean joked that ultra-reliable hardware makes programmers lazy. During the presentation, three key internally developed technologies were presented, GFS, MapReduce, and BigTable. I don't plan on going into all the notes I have on these, in fact I don't understand them all myself, but here's some highlights.

GFS is Google's internally developed file system specifically tuned to their unique requirements (did you really think that NTFS could store all the world's information?). Some of these requirements include huge read/write bandwidth and reliability over thousands of cluster nodes. By writing their filesystem from scratch, they can also ensure that it has the reliability necessary for the crappy hardware used. The filesystem helps Google's 200+ clusters, some of which have up to 5000 machines.

MapReduce is a library of code or "system used for expressing a way of computation such that a programmer must write the computation in a certain style," designed for processing lots of data. The system defines two phases, a map phase and a reduce phase. The map phase extracts relevant information from each record of the input, and the reduce phase collects the data together to produce the final output. This technology is useful for batch processes, and helps solve problems in a standard way. Thousands of programs at Google use this. For other information, search Google for "MapReduce" or read this article.

BigTable is another internally developed technology that Mr. Dean described as a higher level API than a filesystem, somewhat like a database, but not as full featured. I understood this as a giant data structure to store and organize a ton of information. Google's crawlers use BigTable
to help index the vast about of information on the Internet. BigTable is also used by a lot of other data heavy applications for their storage on GFS. Here's a paper Mr. Dean wrote on BigTable.

No comments: