Search Engine War: The Implications to Google Indexing Forms

Google recently made a post on their official blog regarding them beginning to crawl HTML forms this is interesting, not just because it could lead to the discovery of far more pages which is a good thing, or due to it potentially saving developers a lot of hassle when creating SE Friendly content, but because it could cause reasonably well SEO'd websites bit of headache.

Currently Google has told us that they will begin indexing pages in larger websites via the websites form elements when the form uses a GET request (this displays the entire query request in the location bar. POST the alternative does not) and making justified queries relevant to the content on the page.

For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML.

While this might seem fine and quite cool at the same time, I can see this causing duplicate content issues on a large scale if the engines begin to index certain forms, take for example:

<form action="form.php" method="GET>
Name: <input type="text" name="product" />
<input type="hidden" name="prev" value="page2" />
<input type="submit" name="formSubmitted" value="Submit" />
</form>

This would generate a URL similar to

http://www.mydomain.com/form.php?product=search+term&prev=page2&submit=formSubmitted

But of course, as a common task we would either rewrite the URL to make it more elegant for users and search engines, or if that's not possible remove some of the less important name value pairs and just use a portion of the URL. While I expect Google to be able to notice that

http://www.mydomain.com/form.php?product=search+term

is the same as aforementioned, if I have some pages that are regularly changing and I have a URL more like:

http://www.mydomain.com/product/search-term

Will Google treat that as the same page? I doubt it somehow.

The Implications?

While in the short term I don't see this having any drastic implications, I can foresee:

Duplicate Content Penalties
More supplementally indexed pages
Link juice being spread more thinly
Causing issues with your "nofollow sculpting"

Possible Fixes?
While I hate the theory that Google yet again are going to end up forcing us to use more semantic markup in our HTML I can see us having to add one of the following:

Robots.txt the form pages, though other pages may depend on these?
<input type="hidden" name="robots" value="nofollow" />
<meta name="robots" content="NOFORMFOLLOW" />
Add our own hidden attribute to our forms, and using "Disallow: *specialAttribute*" to robots.txt
Not using forms altogether

Google has taken a large step forward that we had been expecting for some time, and it is a good progression for them, however now it is on our doorstep it's worth bearing in mind from a site level SEO perspective it will add a new set of considerations to think about.

Comments

I think the key is to track the urls being indexed and then fix when you spot them.

Otherwise just 301 the form pages to a re-written url as appropriate.

Posted by: Patrick Altoft | Wednesday, 16 April 2008 at 10:32 AM

Nice post...

Posted by: Sreejith | Saturday, 09 May 2009 at 04:43 PM

Drawing forms using JavaScript( or ajax) will solve this issue right? Will google bot respect "NOFORMFOLLOW" meta tag. I can not see any post from google.

Posted by: Sreejith | Saturday, 09 May 2009 at 04:46 PM