Google indexes it's own duplicate content.
Wednesday, 22 November 2006
Google has a massive flaw in its own robots.txt and domains set-up which allows the Google directory and other content to be crawled at a variety of other Google subdomains and domains creating a vast array of permutations of duplicate content.
IE:
http://groups.google.com/Top/
http://news.google.com/Top/
http://images.google.com/Top/
At first we thought it was only Yahoo! picking this up:
http://siteexplorer.search.yahoo.com/search?ei=UTF-8&p=news.google.co.uk%2FTop
But Google themselves seem to be indexing their own erroneous pages:
http://www.google.co.uk/search?hl=en&q=news.google.co.uk%2Ftop
ooo'h that's scary, and hopefully not intentional:
http://news.google.fr/googlebooks/scarystories/
This is bad because Google sensibly has a very strong policy against duplicate content yet is unwittingly allowing it to be fully accessed and indexable. According to the robots.txt it has been this way since at least: 09 November 2006 22:49:53 whether or not it was different before then I am not sure, but I suppose not. The further you dig you find the same robots.txt file with the same last modified date is being used on almost all Google domains and subdomains accept the advertising network parts which suggests a common platform for load balancing and query handling throughout the entire global search platform, and must make it a nightmare trying to organise which parts of the Google network can be indexed or not, hence this mess with duplicate content.
One of the other sites that Google seems to want fully indexed is finance.google.com so this doesn't make an appearance in the robots.txt file which is fine, however it seems that because this is a recent-ish launch someone thought about the problem ran around the other teams to get them to change their code or tweaked the Netscaler rules to put in redirects to finance.google.com when the /finance directory is used straight after the domain. However they didn't get all the subdomains and /finance can still be fully indexed at either www. or finance. and maybe some others.
Matt perhaps you want to get someone to run up the matrix of permutations of subdomains vs directories to be indexed and where from, then to plug the holes with redirects on the Netscalers (it is Citrix Netscalers you use right?), I know it's what you'd ask of us.
Comments