We all like search engines regularly crawl and index our website. But there are certain instances, when we want to hide some Webpages or information from the search engine bots. Actually we do not want to index those pages for security or any personal reasons. So we need a way to tell the spider not to crawl those pages or information that we want to hide from the rest of the world. Fortunately there is a way to do so is Robots Exclusion Protocol.
What is ROBOTS.txt
Definition: Robots.txt is a simple text file that stops/controls the search engine bots/spiders from crawling selected web pages or files type.
If there are particular files and folders that you do not want to be indexed by any search engine, you can use robots.txt. Though it is not mandatory but most of the search bots first check the robots.txt and follow the instruction on it. Having a robot file is a best practice for SEO also (We will discuss it latter). But it should not be considered as a replacement of any other security practices like password protection etc.
Format
Placement of the robots file is very important. It should be in root directory(Example: http:/www.abcd.com/robots.txt) else search bots will not find it.
Structure
The structure of the robots file is pretty simple. But you need to take care of misspelled use-agents, directory path, contradicting logical statements.
User-agent: *
Disallow:
# Allow all robots to visit the website
—————————-
User-agent: *
Disallow: /
# Block all robots to visit the website
—————————–
User-agent: msnbot
Allow:
# Allow all robots from the website
——————————
User-agent: *
Disallow: /tmp/
Disallow: /logs # for directories and files called logs
# Block all robots from tmp and logs directories
—————————–
User-agent: *
Disallow: /photos
Allow: /car/ mercedes.jpg
# This would tell Googlebot that it can visit ” mercedes.jpg” in the car folder, even though the “car” folder is otherwise excluded
Note: Directories and filenames are case-sensitive: “mycar”, “MYCAR”, and “Mycar” are all different to search engines.
——————————
User-agent: *
Crawl-delay: 10
# Crawl-delay parameter is used to set to the number of second’s crawlers to wait between successive requests to the same server.
Sitemap: http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml
Sitemap: http://www.google.com/hostednews/sitemap_index.xml
# Many crawlers also support a Sitemap directive, allowing multiple Sitemaps in the same robots.txt in the form
Robots.txt syntax
User-Agent: -It the web crawler (e.g. “Googlebot,” “msn” etc.)
Disallow: / – Block the whole website for crawler
Disallow: /photos – Disallow the particular folder where as all the other directives are allowed.
Disallow: – Disallow none
Allow: – Allow all the pages and files in the web server.
Crawl-delay: 10 – sets to the number of seconds to wait between two successive crawl requests
Disadvantages
The Robots Exclusion Protocol (RSP) is purely advisory. Though there are terms used disallow but it purely relies on the web crawlers. Robots do not guarantee the exclusion of the disallowed files/folders for all user agents. There are many malicious bots that do not obey the rule and intentionally visit the disallowed links.
Bottom line
No doubt robots.txt is a convenient way to block a specific page form search engines spiders. Effective usage of robots.txt is also essential for SEO too. But you need to write it carefully as a faulty robots.txt file can affect you website.