HTML data for the masses: data dump

HTML5I have been doing regex searches on the HTML of the 8900 or so of the top 10000 home pages I collected over easter and am providing the results of those searches I have conducted so far, in raw form:

Top 10000 web sites home pages HTML code data dump

Searches on the HTML of the 8900 sample pages were conducted on various HTML elements and attributes.

NOTE: the resulting data output files are sometimes large and the HTML code is whoeful, they are supplied as. I will as time permits analyse the data and also clean up the HTML code.

data dump
element/attributeHTML file sizelast modified date
address.html338 KB11/04/2012
alt.html23573 KB12/04/2012
aria.html2566 KB11/04/2012
audio.html5 KB10/04/2012
doctypeall-clean.zip5 KB11/04/2012
figure-figcaption.html3034 KB11/04/2012
footer.html1853 KB10/04/2012
generator.html1548 KB10/04/2012
header.html2659 KB11/04/2012
hgroup.html247 KB10/04/2012
label-placeholder.htm258 KB12/04/2012
longdesc.html2194 KB10/04/2012
nav.html2194 KB11/04/2012
placeholder-title.html467 KB12/04/2012
placeholder.html1489 KB12/04/2012
section.html4202 KB10/04/2012
summaryattribute.html1068 KB12/04/2012
tabindex.html6848 KB12/04/2012
th.html5557 KB12/04/2012
u.html2363 KB10/04/2012
video.html143 KB10/04/2012
top10000URL1.txt330 KB11/04/2012
top10000URL2.txt79 KB09/04/2012

further reading:

Categories: Development

About Steve Faulkner

Steve is the Technical Director at TPGi. He joined TPGi in 2006 and was previously a Senior Web Accessibility Consultant at vision australia. He is the creator and lead developer of the Web Accessibility Toolbar accessibility testing tool. Steve is a member of several groups, including the W3C Web Platforms Working Group and the W3C ARIA Working Group.