Note: How Search Engines Work is the first of a series of planned articles on search engine optimization that will coincide with the release of my next book: Google Ranking Signals: An Official Guide to the World's Most Popular Search engine.
Search engines have two major functions—crawling the World Wide Web and building indexes. They provide answers to a user’s query by algorithmically calculating relevancy and serving results.
Imagine for a second that the World Wide Web is a vast public transportation network. Each stop along a route is a web page, image, PDF or other type of document or site. The tracks or routes that connect these stops are hyperlinks. Hyperlinks are a foundational concept of the Internet and the most essential element on the Web. A hyperlink points to a whole document or to a specific element within a document.
When people browse a web document, they follow hyperlinks. Specially designed applications or programs may also follow hyperlinks automatically. A program that traverses each hyperlink and gathers all the retrieved information is known as a Web spider, bot or crawler. Search engines deploy these crawlers to index the billions of interconnected web documents.
According to moz.com’s Beginner’s Guide to SEO (Search Engine Optimization):
Search engines are answer machines. When a person looks for something online, it requires the search engines to scour their corpus of billions of documents and do two things – first, return only those results that are relevant or useful to the searcher’s query, and second, rank those results in order of perceived usefulness. It is both “relevance” and “importance” that the process of SEO is meant to influence.
To a search engine, relevance means more than simply finding a page with the right words. In the early days of the web, search engines didn’t go much further than this simplistic step, and their results suffered as a consequence. Thus, through evolution, smart engineers at the engines devised better ways to find valuable results that searchers would appreciate and enjoy. Today, 100s of factors influence relevance...
In a 2006 article on the Google Librarian Center, Matt Cutts said this about crawling and indexing:
A lot of things have to happen before you see a web page containing your Google search results. Our first step is to crawl and index the billions of pages of the World Wide Web. This job is performed by Googlebot, our “spider,” which connects to web servers around the world to fetch documents. The crawling program doesn’t really roam the web; it instead asks a web server to return a specified web page, then scans that web page for hyperlinks, which provide new documents that are fetched the same way. Our spider gives each retrieved page a number so it can refer to the pages it fetched.
Our crawl produces an enormous set of documents, but these documents aren’t searchable yet. Without an index, if you wanted to find a term like civil war, our servers would have to read the complete text of every document every time you searched.
So the next step is to build an index. To do this, we “invert” the crawl data; instead of having to scan for each word in every document, we juggle our data in order to list every document that contains a certain word. For example, the word “civil” might occur in documents 3, 8, 22, 56, 68, and 92, while the word “war” might occur in documents 2, 8, 15, 22, 68, and 77. Once we’ve built our index, we’re ready to rank documents and determine how relevant they are. Suppose someone comes to Google and types in civil war. In order to present and score the results, we need to do two things:
- Find the set of pages that contain the user’s query somewhere
- Rank the matching pages in order of relevance
Matt Cutts has been the head of the Google Webspam team for over 15 years and one of Google’s primary public facing figures. While Cutts can’t reveal how Google uses its 200+ signals in determining search quality and rank because spammers would take advantage of that information, he regularly advises that certain behaviors, such as guest blogging, might be punished by Google. Cutts recently announced that he is taking a long break from his job, to spend more time with his wife and family.