Robots.txt
07 October, 2008
What Is Robots.txt?
Robots.txt is a text file used to tell search engine bots that which pages you would like them not to visit. For search engines Robots.txt is not mandatory but generally SEs follows what they asked not to do. It is important to elucidate that Robots.txt is not a way from keeping SEs to crawling your site. Robots.txt is exploited something like arrange a note “Please, do not enter” on an unlocked door.
It is very important, where we put robots.txt in our site. It must be in the main directory otherwise user agent can not to find it- they don’t search whole website. Instead they first look in the main directory (i.e. http://domainname.com/robots.txt ) and if they don’t find robots.txt there, they simply assume that this site does not have this file and they index everything they find along the way. If search engines index your whole site then do not surprised because you don't put robots.txt in the right place.
Structure of a Robots.txt File
The structure of a robots.txt is quite simple and flexible – it is an endless list of user agents and disallowed files and directories. Basically, the syntax is as follows:
User-agent:
Disallow:
“User-agent” is search engines' crawlers and
“Disallow”: lists the files and directories to be excluded from indexing.
In addition to “user-agent:” and “disallow:” entries, you can include comment lines – just put the # sign at the beginning of the line:
# All user agents are disallowed to see the /temp directory.
User-agent: *
Disallow: /temp/
The Traps of a Robots.txt File
When you start making complicated files – i.e. you decide to allow different user agents access to different directories – problems can start, if you do not pay special attention to the traps of a robots.txt file. Common mistakes include typos and contradicting directives. Typos are misspelled user-agents, directories, missing colons after User-agent and Disallow, etc. Typos can be tricky to find but in some cases validation tools help.
The more serious problem is with logical errors. For instance:
User-agent: *
Disallow: /temp/
User-agent: Googlebot
Disallow: /images/
Disallow: /temp/
Disallow: /cgi-bin/
The above example is from a robots.txt that allows all agents to access everything on the site except the /temp directory. When Googlebot starts reading robots.txt, it will see that all user agents are allowed to all folders except /temp/. This is enough for Googlebot to know, so it will not read the file to the end and will index everything except /temp/ - including /images/ and /cgi-bin/, which you think you have told it not to touch.

0 comments:
Post a Comment