For any website administrator it is important to regulate how search engine robots suck the life out of your site. While it is important to consider .htaccess for password protecting and leaching of images simply inserting a robots.txt file can reduce a lot of load on your site.
This is very important for administrators who use Amazon Webstores and for anyone that feeds their content off of remote xml data. If the generation of your sites pages can be unlimited based on an external data resource you should examine more indepth ways to restrict access to any bot.
For WordPress users you want to restrict access to files and directories that are not part of your site. You may also want to restrict access to your images and cache files.
A basic WordPress robots.txt file could include the following:
Notice User Agent * is all…. and that disallows are at the top
followed by overrides for Google AdSense.
An included badbot is also banned based on their user agent info found in the webserver logs.
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /trackback
Disallow: /feed
Disallow: /comments
Disallow: /category/*/*
Disallow: */trackback
Disallow: */feed
Disallow: */comments
Disallow: /*?*
Disallow: /*?
Allow: /wp-content/uploads
# Google AdSense
User-agent: Mediapartners-Google*
Disallow:
Allow: /*
# BadBot found in my logs
User-agent: badbot
Disallow: /
Remember that a robots.txt file is only as good as the bot that reads it.
If someone is using wget to grab your site or a bot designed to harvest emails and links you are better off stopping it with script level denial, .htaccess or firewall commands.
If your site is getting hammered then contact your Service provider for help.
This info is just for reducing unneeded or unnecessary load by having bots run around in circles not finding what they should be looking for.