Google data suggests 95% of the web is junk
Monday, 14 January 2008
William here spotted a very interesting post about the amount of data processed by Google each day. You can see the input data is mass produced, which indicates this is probably coming from crawler feeds and uploaded feeds (Base) and the output we guess is what they actually use for their search results.
http://www.techcrunch.com/2008/01/09/google-processing-20000-terabytes-a-day-and-growing/
We can deduce (I admit rather cruedly though) that Google thinks around 95.5% of internet content is not worth including in their index: spam, erroneous, duplicate or plain not relevant. I don't know the ins and outs of Googles processes so there's also a possibility that much of the data might be html and other non essential content, however from the volume of data they input a lot gets dumped.
Based on Williams estimates:
20k terabytes a day at average 40kb item size = 547 billion web pages/content items processed per day
2,000,000,000* images indexed
+ 30,000,000,000* web pages indexed
= 32,000,000,000* indexed content items
* Estimate based on 2006 figures http://en.wikipedia.org/wiki/Google_search
That would mean that they'd ignore/filtered out
547,000,000,000 - 32,000,000,000
= 517,000,000,000 content items
Amount kept: 5.484%
Amount filtered or dumped: 94.516%
Compared with the map reduce input and output data volume:
map input data (TB) 403,152
- reduce output data (TB) 14,018
= (TB) 389,134
Amount kept: 3.477%
Amount filtered or dumped: 96.533%
Mean: (94.516% + 96.533%)/2 = 95.5245%
How very sad. I knew it was bad just not this bad.
Posted by: boris | Monday, 25 February 2008 at 03:12 AM