HTML data for the masses: data dump

HTML5I have been doing regex searches on the HTML of the 8900 or so of the top 10000 home pages I collected over easter and am providing the results of those searches I have conducted so far, in raw form:

Top 10000 web sites home pages HTML code data dump

Searches on the HTML of the 8900 sample pages were conducted on various HTML elements and attributes.

NOTE: the resulting data output files are sometimes large and the HTML code is whoeful, they are supplied as. I will as time permits analyse the data and also clean up the HTML code.

data dump
element/attribute HTML file size last modified date
address.html 338 KB 11/04/2012
alt.html 23573 KB 12/04/2012
aria.html 2566 KB 11/04/2012
audio.html 5 KB 10/04/2012
doctypeall-clean.zip 5 KB 11/04/2012
figure-figcaption.html 3034 KB 11/04/2012
footer.html 1853 KB 10/04/2012
generator.html 1548 KB 10/04/2012
header.html 2659 KB 11/04/2012
hgroup.html 247 KB 10/04/2012
label-placeholder.htm 258 KB 12/04/2012
longdesc.html 2194 KB 10/04/2012
nav.html 2194 KB 11/04/2012
placeholder-title.html 467 KB 12/04/2012
placeholder.html 1489 KB 12/04/2012
section.html 4202 KB 10/04/2012
summaryattribute.html 1068 KB 12/04/2012
tabindex.html 6848 KB 12/04/2012
th.html 5557 KB 12/04/2012
u.html 2363 KB 10/04/2012
video.html 143 KB 10/04/2012
top10000URL1.txt 330 KB 11/04/2012
top10000URL2.txt 79 KB 09/04/2012

further reading:

Categories: Development

About Steve Faulkner

Steve is the Technical Director at TPGi. He joined TPGi in 2006 and was previously a Senior Web Accessibility Consultant at vision australia. He is the creator and lead developer of the Web Accessibility Toolbar accessibility testing tool. Steve is a member of several groups, including the W3C Web Platforms Working Group and the W3C ARIA Working Group.