My blog engine creates multiple copies of the same entry in different directories. For example, a given entry would have:
- its own unique URL
- a place in the topic index
- a place in the month index
- a place in the year index
I would like Google to prefer the first entry, since it makes finding content easier. I've tried using robots.txt to exclude indexing of the topic and date indexes, but then Google ignores or cannot find the unique URLs either.
Would a sitemap help here? Perhaps using robots.txt to exclude the indexes, and generate a sitemap dynamically to point to just the unique URLs? If you've solved this sort of issue, please let me know.
I ended up adding
<meta name="robots" content="noindex, follow">
to all of the index pages using this Perl one-liner:
find -name index.html -print0 | xargs -0 perl -pi -e 's/<head>/<head>\n<meta name="robots" content="noindex, follow">/g'
That way, the search engines can find the canonical URLs, but will ignore all of the topic and date index pages. Will report back once I've had a chance to see how it works out over the next few weeks.
UPDATE 1: Just found someone with the same issue who solved it the same way (with "noindex, follow"):
Scroll down to lammert's post #:744047.
UPDATE 2: After a few weeks Google was properly indexing my site, and continues to do so after more than a month.
You're looking for a canonical link. See here: http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html