Search Engine War: Google data suggests 95% of the web is junk

Monday, 14 January 2008

William here spotted a very interesting post about the amount of data processed by Google each day. You can see the input data is mass produced, which indicates this is probably coming from crawler feeds and uploaded feeds (Base) and the output we guess is what they actually use for their search results.

http://www.techcrunch.com/2008/01/09/google-processing-20000-terabytes-a-day-and-growing/

We can deduce (I admit rather cruedly though) that Google thinks around 95.5% of internet content is not worth including in their index: spam, erroneous, duplicate or plain not relevant. I don't know the ins and outs of Googles processes so there's also a possibility that much of the data might be html and other non essential content, however from the volume of data they input a lot gets dumped.

Based on Williams estimates:
20k terabytes a day at average 40kb item size = 547 billion web pages/content items processed per day

2,000,000,000* images indexed
+ 30,000,000,000* web pages indexed
= 32,000,000,000* indexed content items
* Estimate based on 2006 figures http://en.wikipedia.org/wiki/Google_search

That would mean that they'd ignore/filtered out
547,000,000,000 - 32,000,000,000
= 517,000,000,000 content items

Amount kept: 5.484%
Amount filtered or dumped: 94.516%

Compared with the map reduce input and output data volume:
map input data (TB) 403,152
- reduce output data (TB) 14,018
= (TB) 389,134

Amount kept: 3.477%
Amount filtered or dumped: 96.533%

Mean: (94.516% + 96.533%)/2 = 95.5245%