My blog engine creates multiple copies of the same entry in different directories. For example, a given entry would have:

  1. its own unique URL
  2. a place in the topic index
  3. a place in the month index
  4. a place in the year index

I would like Google to prefer the first entry, since it makes finding content easier. I've tried using robots.txt to exclude indexing of the topic and date indexes, but then Google ignores or cannot find the unique URLs either.

Would a sitemap help here? Perhaps using robots.txt to exclude the indexes, and generate a sitemap dynamically to point to just the unique URLs? If you've solved this sort of issue, please let me know.

  • Thanks to you both - exactly what I was looking for. Miles about 4 years ago
  • Actually, there is a problem: as the Google help page states: --- add this <link> tag to specify your preferred version: <link rel="canonical" href="http://www.example.com/product.php?item=swedish-fish" /> inside the <head> section of the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Google will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. --- The problem is that there are a number of duplicates on a single index page, and I can only specify one of them in link rel="canonical". Any other workarounds? Miles about 4 years ago

3 answers

Miles 0
0
points
This was chosen as the best answer

I ended up adding

<meta name="robots" content="noindex, follow">

to all of the index pages using this Perl one-liner:

find -name index.html -print0 | xargs -0 perl -pi -e 's/<head>/<head>\n<meta name="robots" content="noindex, follow">/g'

That way, the search engines can find the canonical URLs, but will ignore all of the topic and date index pages. Will report back once I've had a chance to see how it works out over the next few weeks.

UPDATE 1: Just found someone with the same issue who solved it the same way (with "noindex, follow"):

http://www.webmasterworld.com/forum30/28772.htm

Scroll down to lammert's post #:744047.

UPDATE 2: After a few weeks Google was properly indexing my site, and continues to do so after more than a month.

Answered about 4 years ago by Miles
  • Be careful, the search engines might be using the topic and date index pages to find all your posts - are there links to the posts available from the pages that you haven't disbarred indexing from? Rob Crowther about 4 years ago
  • Thanks for the follow up, Rob! I could setup an auto XML sitemap if need be. However, my understanding is that by specifying "noindex, follow", Google will not index the page for searching, but will follow the links therein. Is that not correct? Miles about 4 years ago
153351 5
1
point

You're looking for a canonical link. See here: http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html

Answered about 4 years ago by 153351
1
point

A site map may help, but what you want to do is figure out how to make your blog engine create canonical url links for your post. Wordpress, for example, does this automatically unless the theme explicitly disables it.

Answered about 4 years ago by Rob Crowther