Posts Tagged ‘Software Program’

How Search Engines REALLY Work – Most Webmasters Have No Clue…

Tuesday, September 15th, 2009

This a bit of a technical post, but you don’t have to know the nitty gritty of how Search Engines work, just some of the basics that you may not know about.

The first job of the search engines is to crawl the web, this means sending out the bots to spider all of the pages, index those pages and analyze the links that they find along the way.

The user side means that they have to deliver search results so they have to handle queries, retrieve the pages based on the queries, rank the pages they have retrieved and then displaying the SERPs (search engine results pages), all of this happens very fast, in a matter of milliseconds.

The first thing we want to talk about is URL or Page Discovery, this means how the search engines actually go about finding pages on the web. URL actually stands for Uniformed Resource Locator, but there’s no need to go into that much detail, for us it is simply a page on the web.

There are several ways that a search engine can find your pages, for example you can submit an xml sitemap to all of the major search engines, the xml format is something that is standard across all of the SE’s, you can submit your site to site directories, though that is not as useful as it was even a few years ago and some SE’s even offer paid inclusion.

By far the most common way that a search engine will find your URLs is via links on other “indexed” pages that point to your pages. This is how the search engines prefer to find your pages and it will provide the greatest weight to your pages.

An xml sitemap is a great way to tell the SE’s about your pages, but what many webmasters don’t realize is that this is not a mandatory call to action for the spiders, it is still up to the SE’s to analyze these pages and determine whether or not they will be spidered and added to the index.

So what exactly is a spider and how does it work?

A spider or bot or crawler is a software program designed by the SE’s to simply crawl the web and collect data or as they call it “fetch pages”.

The spiders are required to follow certain rules that you can set forth for them regarding how you want them to spider your site, these are contained in a file called a robots.txt file and they refer to the Robots Exclusion protocol. There are many tutorials online that talk about the robots.txt options in depth.

Although a spider has visited your site does not mean you will automatically be indexed, the spider simply collects the data and stores it for the engine to analyze later and determine if it is to be indexed or not.

The spider communicates with your server when requesting to look at your pages, the protocol that it follows is called HTTP or Hyper Text Transfer Protocol.

So this is the basic interaction:

1. The spider sends a “GET” request to your server which basically says “GET yourpage.html” included in this request is the “HOST” information and this will say something like “HOST www.yourdomain.com”.

Many spiders will also send a request that looks like this” IF-Modified-Since” this is a way to ask the server if the page has been modified since the last time they requested if. If it has not they can skip it and move on.

The spider will also include information telling the sever who it is, for example “Googlebot” form Google or “Slurp” from Yahoo.

2. The server will respond to the spider telling it the status of the page requested. There are several different status codes that can be returned, for example:

  • 200 Status OK
  • 301 Moved Permanently
  • 302 Found but Moved Temporarily
  • 304 Not Modified Since Last visit
  • 404 Not Found
  • 410 Gone for Good
  • and 500 Server Error

3. Assuming the spider finally reaches your page, it will simply store all of the html code in the spider database.

at this point the spiders job is done, for this particular page anyway.

Parsing and Indexing of your pages actually happens later on in the Search Engine software itself, not by the spider.

Parsing means that the SE software will now process the page and prepare it for indexing and ranking, this means stripping out stuff like javascript, most of the formatting tags, stripping out unneeded html code such as iFrames and whatever content is within them.

What the Engine will leave in and pay attention to are the tags it finds important such as Title tags, MetaKeyword and Description tags, Header tags H1-H6, A Anchor tags and IMG image tags.

Now here is where it gets interesting…

It’s time for the SE to store your “Page” in it’s index, at least that’s what we commonly think happens. But in reality, the engine stores the words in the index and uses them to reference our pages. Each word is stored in the index and then given the attributes as they relate to our page. For example, if the word “automation” is found on our page, it will be stored in the index along with the URL it was found on, where it was found on the page like in the title tag, description etc, how many times or in which position for example 2nd word in the title or 54th word in content, the html attribute assigned to the word for example A anchor text or H2 header tag etc. Of course these would all be represented by some type of numeric binary code to make it easier and faster for the ndex to find and sort later.

That was a pretty complex paragraph! The important thing to understand here is that the index does not store our pages, it stores our words as they relate to our pages.

This gets confusing sometimes because Google will present us with a full cache of our pages, but this is simply something that Google offers to it’s users and it is NOT what the is being queried when we perform searches.

Links and anchor text are stored in a separate indexes altogether. First the engine will filter out all of the “nofollow” links, the duplicate links and any links that return an error such as a 404, 410 or 500 response.  Once it determines the links it wants to keep track of it will place the link in a special index and then place the anchor text to that link in a different index.

When searching for backlinks on the internet’ it’s very important to keep in mind that Google does not share this information freely, or at all for that matter. When it comes to finding backlinks, Google is very inaccurate, that does not mean they do not know, they do, but sharing that information accurately with webmasters would make it very easy for them to figure out Google’s algorithms and manipulate their rankings and that is not something that they want happening.

With links in mind, it is important to know the value placed on errors returned. Internal links that point to pages on your site which return an error raise a red flag and are considered very bad. Not so much links that are pointing to outside pages on the web that you really cannot be held accountable for, as long as they are limited in number.

Here are some popular misconceptions that should be cleared up now :) :

  • Search Engines Don’t store web pages, they store words
  • Search Engines DO NOT search the web when you use their search tool, they only search “some” of what is in their index
  • If you are not in the index, you will not appear in search results
  • Search engines do not rank pages, first they receive a query from a user, find that page in the index based on the text used and rank its relevance to the search query
  • This means that the same page can rank differently for many different keywords and keyword phrases and will rank differently on those SERPs as well.

I hope this basic explanation has helped you better understand how SE’s actually do what they do!

Feel free to link to this page and send it to your friends as well. Use the handy tweet link below to tweet about it too!

Post to Twitter Tweet This Post

1
bottom
NextGenLinks SEOLinkAutomation « blog home  |  contact us » Valid XHTML 1.0 Transitional
©SEOLinkAutomation 2010.


Twitter links powered by Tweet This v1.6, a WordPress plugin for Twitter.