htaccess and robots.txt

(Submitted Mon, 2006-09-18 04:10) | |

robots

The robots.txt file tells various spidering engines, like those used by search engines, what content to index and what content to leave alone.

You don't strictly need a robots.txt file in your root Drupal directory if you are running a public site. Without one, however, your admin log will start filling up with "robots.txt not found" warnings.

A quick solution is to create an empty robots.txt file. Search engine spiders will find the file, will not encounter any disallow rules, and - hopefully! - go about their business of indexing your website.

Yet there is an even better approach. Why not actually list the directories that you don't want the spider indexing or wasting its time on? Think of the printer friendly pages of, for example, book pages. Duplicate content. Duplicate content bad (just ask Google).

Here's a sample robots.txt that will help keep the spiders where you want them ... in your main content. This one courtesy of twohills.

User-agent: *
Crawl-Delay: 10
Disallow: /aggregator/
Disallow: /tracker/
Disallow: /comment/reply/
Disallow: /node/add/
Disallow: /taxonomy/
Disallow: /user/
Disallow: /files/
Disallow: /search/
Disallow: /book/print/
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /admin/

htaccess

There is reportedly a problem with the stock Drupal 4.7 .htaccess file. It attempts to redirect accesses to yoursite.com/somenode to www.yoursite.com/somenode. But instead it actually redirects to just the frontpage.

This replacement rewrite condition supplied by alliax fixes that oversight.

# This is the better way to do it:
RewriteCond %{HTTP_HOST} !^example\.com
RewriteRule (.*) http://example.com/$1 [R=301,L]

Node Reference: 64780

Submitted by Tips (not verified) on Fri, 2006-09-29 19:24.

I think there is a typo. It should be:

RewriteCond %{HTTP_HOST} !^example\.com
RewriteRule (.*) http://example.com/$1 [R=301,L]

("If not RewriteCond, then RewriteRule")

Submitted by admin on Sat, 2006-09-30 18:29.

Whoops. Good catch, thanks.

Corrected above.

Submitted by Roman Novak (not verified) on Mon, 2006-12-11 00:15.

Hello, I think it is easy to upload the robots.txt file on the server instead of try to set it up by Drupal. And also why to put for example "Disallow: /includes/" etc. into a robots.txt? Because if does not exist links to these folders (/includes/...) I would not be worry about search robots. ok Roman

Submitted by OP Tech Works (not verified) on Sat, 2007-02-24 16:25.

I like the module because I have a multiple sites running off the same installation, with the module the robots.txt is backed up in the databases too.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote>
  • Lines and paragraphs break automatically.
  • You may post code using <code>...</code> (generic) or <?php ... ?> (highlighted PHP) tags.
More information about formatting options

Hosted By Dreamhost.com


Did you know?

You don't need to register at WWDD to post comments.

Isn't it annoying when you want to comment on an article, but don't want to go through the hassle of creating yet-another-user account at yet-another-website?

Feel free to comment anonymously, or log in with your username@drupal.org account.

We won't mind a bit.