Robots.txt, meta tags; Blogger’s Ninja Tool to control how search engines index your site

How does Google know what parts of a website the site owner wants to have show up in search results? Can publishers specify that some parts of the site should be private and non-searchable? The good news is that those who publish on the web have a lot of control over which pages should appear in search results – through robots.txt and meta tags.

RobotsI’ve been doing some bit of my own research on how to weed out non-usable contents and show just the good contents to search engines. I began to use robots.txt sometime back but they are limited to just disallow of some folders (like my wordpress installation folder – “wp”). The other day, I was reading the Robots Exclusion Protocol (REP) on Google’s own Blog, and learnt a lot that was missing from my understanding of how you can take control of Search Engines indexing your site’s content.

Meta Tags

Google have recently introduced a new META tag that will allow us to set when we want the page to be removed from the main Google Web Search results. For instance, if you want to remove a particular page after end of this year, then add the following Meta tag to your page (the date format is RFC 850);

META NAME="GOOGLEBOT" CONTENT="unavailable_after: 31-Dec-2007 24:00:00 GMT

However, the REP Meta tags works only for HTML Pages. Nonetheless, Google gave us an option to control access to other documents – Adobe PDF Files, Video and Audio file and many other types. This thus extends the same flexibility for specifying per-URL tags to all other file type. You’ve to simply add any supported Meta tag to a new X-Robots-Tag directive in the HTTP Header used to serve the file.

Here are some examples;

→ Don’t display a cache link or snippet for this item in the Google search results
X-Robots-Tag: noarchive, nosnippet

→ Don’t include this document in the Google search results
X-Robots-Tag: noindex

→ Tell us that a document will be unavailable after 31st Dec 2007, 12:00 pm GMT
X-Robots-Tag: unavailable_after: 31 Dec 2007 24:00:00 GMT

You can combine multiple directives in the same document.

→ Do not show a cached link for this document, and remove it from the index after 31st Dec 2007, 12:00 pm GMT
X-Robots-Tag: noarchive
X-Robots-Tag: unavailable_after: 31 Dec 2007 24:00:00 GMT

Robots.txt

Robots.txt allows you to control how search engines access your web site. It allows us to control access at multiple levels — the entire site, through individual directories, pages of a specific type, down to individual pages.

Googlebot specific robots.txt

Google Robots, unlike other bots, allow the use of wildcards – * – to match a sequence of characters. This way, we can do complex Allow and Disallow directive to the Googlebots.

To block all wordpress files from being crawled by Googlebots, we can have

User-Agent: Googlebot Disallow: /wp-*.php

It can be even in the form of a folder patten – here, a pattern of myimages_xyz can be blocked (where xyz represents your numbers folders or something similar)

User-Agent: Googlebot Disallow: /myimages_*/ Disallow: /porn/*.jpg (see, I can block google from looking at my porn images ;-) )

The Googlebot also has an allow tag to allow your files, folders to be crawled by it. This is particularly useful when used in combination with the Wildcard pattern matching scheme to create more complex robots.txt. Here, let’s block a sub-folder on a site but allow some specific folders or files within that sub-folder. Let’s assume that we have installed WordPress inside a folder called “wp” at the root of the website. So, let’s block the wp folder but allow the /wp/wp-content/uploads/ to be crawled.

User-Agent: Googlebot Disallow: /wp/ Allow: /wp/wp-content/uploads/

You can do a even more complex Disallow/Allow pattern matching. Let say, if “?” indicates a session ID, we might like to exclude all URLs that contain them to avoid duplicate pages for the Googlebot. However, URLs that ends with a “?” may be the page that we want to include. In this scenario, we can block any URL that includes a “?” but not the one that specifically ends in “?”

User-Agent: Googlebot Allow: /*?$ Disallow: /*?

A combination and permutation of robots.txt and Meta Tags can help you fine-grain control over your site’s content. Together, robots.txt and META tags give you the flexibility to express complex access policies.

REFERENCES

[poll=26]


Don't like it? There are lots of published articles, pick a random one.

Brajeshwar posted this article on Sat, Jul 28th, 2007 at 2:07 pm
Categorized under General

Prev Article:

Next Article:

Archives: Visit the Archives for more articles.

  • http://blog.flashcolony.com/?p=190 Robots.txt, meta tags; Blogger’s Ninja Tool to control how search engines index your site « FlashColony webmaster’s blog

    [...] So what is the myster about robots.txt ? Have you ever seen Googlebot browsing your website? Well, it is the way how search engines index your website. More about it here. [...]

  • http://www.rishiraj.info Rishi

    I use robot stuff in my blog’s META only, will it work with your factors?

  • http://www.rishiraj.info Rishi

    I use robot stuff in my blog’s META only, will it work with your factors?

  • http://www.brajeshwar.com/ Brajeshwar

    It should with Google as they introduced it recently (and not me).

  • http://www.brajeshwar.com/ Brajeshwar

    It should with Google as they introduced it recently (and not me).

  • http://flash.blogslive.net/archives/52 Excellent flash resources » Blog Archive » Creating robots.txt files manually
  • http://ingeniousminds.blogspot.com Ingnious Mind

    Hey friend….I was just wondering my adsense account for blogger does not get approved, it’s probably as google says it cannot crawl pages and thereforeI dont get access….can u help ??

  • http://ingeniousminds.blogspot.com Ingnious Mind

    Hey friend….I was just wondering my adsense account for blogger does not get approved, it’s probably as google says it cannot crawl pages and thereforeI dont get access….can u help ??

blog comments powered by Disqus

Sidenotes

Quick notes, scribbles, somehow related to this website and to what I do. Or perhaps I'm just plain lazy to make them into a full article.

12 Hottest Geek Girls on Twitter

So, you have seen the 12 Hottest Geek Girls (via Digg). However, they forgot to link them to their Twitter profiles so you can follow them. Well, here they are -- the 12 Hottest Geek Girls ...13th Oct, 2009

Great Indian Developer Summit 2009

I got a Press Release of the upcoming GIDS '09 and here is an excerpt. The summit's program covers Java, REST, Unit testing, Groovy, Spring, Struts 2.0, SOA, Cloud Computing, Web Services, JRuby, RoR, Ruby, JVM, ...21st Jan, 2009

The flourishing gun market in Pakistan

VICE Travel: Darra, Pakistanby Top-Notch112 (Via: Deep Green Crystals) 20th Jan, 2009

Angry Ringtone for iPhone and others

[audio:http://audio.brajeshwar.com/angry-ring-ring.mp3] The ANGRY RINGTONE for iPhone. (Click the PLAY button above!) Download * iPhone Ringtone (.m4r) * MP3 Ringtone (.mp3) * Zipped (both .m4r and .mp3) To use it as an iPhone Ringtone; just double click the file "angry-ring-ring.m4r" and it ...15th Jan, 2009

IIM Ahmedabad's Leverage 2009

Leverage, the Venture Capital and Private Equity Club of IIM Ahmedabad and the Centre for Innovation Incubation and Entrepreneurship bring to you the 1st edition of the Venture Capital and Private Equity Conference on the ...12th Jan, 2009

View the Sidenotes Archive

Play the Penguin Game

Recommended

  • Downloads All downloads, Free and Open Source.
  • Ode to Apple Dedicated to Apple – Mac, iPhone, iPod, iTunes, Quicktime, Apple TV and all the awesome softwares for the Apple Mac.
  • AS 2.0 Reference Reference for ActionScript 2.0 Programming Language used in Flash. Primarily stashed here for my own personal reference.
  • Not Safe for Work Ever clicked a link and felt embarrassed with the content in front of your co-workers? Ever caught unaware because the funny link your friend sent was a little beyond funny? Let’s minimize that with NSWF.
  • o! Just Me Of colorful cultures, entertainment, media, life hacks, music, books and movies from hollywood & bollywood.
  • ActionScript 3.0 Reference Flash/Flex ActionScript 3.0 Reference.

Download free Brajeshwar Wordpress Theme

Brajeshwar

Brajeshwar I firmly believe in keeping things simple, easy for users and I envison pushing the technical envelop time and again for the betterment of viable commercial and practical applications. More about me.

Photos

More photos on Flickr

Member of 9rules Network

Since its inception on 11th June, 2001, "Brajeshwar" has 1,000 Articles and 9,562 comments, contained within 17 categories and 1,650 tags.