Issue #44 |
|||||||||||||||||||||||||||||
Last Update March 2, 2006 |
|||||||||||||||||||||||||||||
Technology Google from the Inside by David Katz At Boston Usenix, the UNIX operating system technical conference held this month, Rob Pike, the guru behind Google, provided a glimpse of what goes on behind the scenes and the philosophies that made Google successful. Google essentially consists of three components: a crawler, an indexer and a query handler. The crawler consists of software that goes through the internet and inspects every single web page, a monumental task when you consider that there are over four billion web documents. The crawler visits each site, notices changes from its last visit or identifies the pages or documents (PDF, Word documents and other) as new, and creates an abstract that can be fed to the indexer. The indexer does what its name implies; it creates indexes that can be used to maintain the system and respond to user queries. The indexing algorithm has evolved over the years to be extremely sophisticated and very, very fast. The query handler is the software that interfaces with the user who wants to do a Google search. Key words submitted by the user are parsed and used to search the indexes. The results are presented in the order of which sites match the most keywords, and then by popularity of the site. Unlike many of its competitors, Google does not put sites that pay for position at the top of the list. Paid references are presented off to the side of the actual list, so that users are aware that these sites have paid for special handling. To provide speed of response, the query process is broken into small pieces that can be handed to many computers to run simultaneously. More than 1000 computers may be involved in handling a single query. Early in Google's development, the company's founders made the key decision to use cheap, commodity hardware and put the bulk of the company's resources in software development. Failure will occur in any system, no matter how expensive the hardware. If the software is sufficiently robust, the company can afford to buy lots of cheap hardware instead of a few very expensive pieces, and will end up with a more robust system overall. Reliability is achieved through replication, load spreading and a clever task routing scheme that notices when a piece of hardware has failed and routes the task to a functioning computer. In order for this philosophy to succeed, data and other significant information must be duplicated many times in the overall system, so that when a computer that is normally the primary computer for dealing with a piece of information fails, others can take over immediately. Components may fail, but the system never does. By breaking the resources into pieces, not only is fault tolerance achieved, but scalability, the ability to grow the system as demand grows, is also achieved. Inexpensive, off-the-shelf hardware (much of Google's hardware is not very different from what can be found on an office desk or in the home) and first-rate software make for an extremely flexible and practical system that has kept Google at the forefront of search companies. Geographically well distributed, Google is thus able to cope with power failures, communications failures, viruses and attacks of various kinds. It serves millions of users with a robustness and economy that banks, exchanges, retail organizations and the government can only envy. |
|||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||
New York Stringer is published by NYStringer.com. For all communications, contact David Katz, Editor and Publisher, at david@nystringer.com All content copyright 2005 by nystringer.com |
|||||||||||||||||||||||||||||
Click on underlined bylines for the author’s home page. |
|||||||||||||||||||||||||||||