I’m breaking this post into two. My aims is to shed some uncertainty you may have about search engines and in order to do so, I’d like to keep it simple. After all, if you want to benefit from being listed on search engines, you'd better know how they work in the simplest manner possible.
Think of the Number Three
Crawler-based search engines are made up of three major elements: the spider, the index, and the software. Each has its own function and together they produce what we have come to trust (or distrust) on the SERPs (Search Engine Results Pages).
The Crawling Spider
Also known as a web crawler or robot, a search engine spider is an automated program that reads web pages and follows any links, preferably text based, to other pages within the site. This is often referred to as a site being "spidered" or "crawled". There are three very active spiders on the Net. Their names are Googlebot (Google), Slurp (Yahoo!) and MSNBot (MSN Search).
Spiders start their journeys with a list of page URLs that have previously been added to their index (database). As the spider visits these pages, crawling the code and content, it adds new pages (links) that it finds on the page to its index. As such, one could refer to a spider as feeding an evolving index, which is discussed below.
Search engine spiders return to the sites in its index on a regular basis, scanning for any changes. How often the spider returns is up to the search engines to decide. Website owners do have some control in how often a spider visits their site though by making use of a robot.txt file. Search engines first look for this file before crawling a page further. So, if for instance you didn’t want a page on your site to be indexed and listed on the Search Engines, then you would edit the robot.txt file.
The Growing Index
An index is like a giant catalogue or inventory of websites containing a copy of every web page and file that the spider finds. If a web page changes, this catalogue is updated with the new information. To give you an idea of the size of these indexes, the latest figure released by Google is over 8 billion pages.
It sometimes takes a while for new pages or changes that the spider finds to be added to its index. Thus, a web page may have been "spidered" but not yet "indexed." Until a page is indexed - added to the index - spidered pages will not be available to those searching with the search engine.
Comments