HTML5 Accessibility Chops: data for the masses

HTML5One of the stumbling blocks for working out what the effects on the accessibility of new (and old) HTML5 features is not having any publicly accessible usage data. It is difficult without data to argue for the inclusion of features in HTML5 or working out how features should be accessibility supported. I have made an initial attempt to rectify this by collecting the HTML content of the home pages of the the top 10,000 web sites.I spent most of the Easter long weekend collecting the HTML pages. The original source for the “top 10,000” sites URLs was from this URL list I found on paste bin. I used HTTrack website copier to capture the HTML files. The initial pass was somewhat effected by redirects, so I went through the error log and collected a second list of URLs from the captured pages that had resulted in “page has moved” files. The resulting 8915 HTML pages are a result of the 2 sets of URLs. The HTML content (including URL lists) is provided as a zip file:

Top 10000 HTML files zip file – 121 MB (Please only download if you are going to make use of the data)

hgroup element usage

I have only just started to analyse the data. The first analysis is of the new HTML5 hgroup element and this is as yet only a simple gathering of instances of its use. No attempt has been made as yet, for example, to analyse what percentage of its use conforms to HTML5 author conformance requirements.

Of the top 8915 HTML pages, 79 (0.89%) were found to include use of the HTML5 hgroup element. A total of 418 instances of the hgroup were found within the 79 pages.

Instances of hgroup element use in top 10000 web sites – home pages

Inclusion of hgroup in HTML5

Note: I am a proponent of the removal and/or replacement of hgroup in HTML5, there are currently 5 change proposals being reviewed by the W3C HTML working group chairs on this subject:

  1. Change Proposal: replace hgroup with the subline element
  2. Change Proposal: no-change hgroup
  3. Change Proposal: replace hgroup with a simple element
  4. Change Proposal: remove hgroup add an outlineMask attribute
  5. Change Proposal: Replace <hgroup> with an element that has a simple content model and backwards compatibility.
Categories: Development

About Steve Faulkner

Steve was the Chief Accessibility Officer at TPGi before he left in October 2023. He joined TPGi in 2006 and was previously a Senior Web Accessibility Consultant at vision australia. Steve is a member of several groups, including the W3C Web Platforms Working Group and the W3C ARIA Working Group. He is an editor of several specifications at the W3C including ARIA in HTML and HTML Accessibility API Mappings 1.0. He also develops and maintains HTML5accessibility and the JAWS bug tracker/standards support.


John Jensen says:

Hi Steve,

Over at Mozilla we’ve been doing something similar, but with regard to CSS properties. You might find some of the data sets in this ticket to be of use: . If not, ping me directly and I can probably get you more.

AlastairC says:

Hi Steve,

It would be interesting to know how many of sites actually use (or try to use) HTML5. For example, how many use a non-HTML4/XHTML doctype.

You might find that the 0.89% might be 10% of sites using HTML5, or 1%, or 50%…

It would help put the figures in perspective.

Steve Faulkner says:

Hi Alastair, I am crunching the data at the moment, and will provide more details soon. I have looked at how many use the HTML5 doctype and found that approx 17% of the sample pages use it.

Steve Faulkner says:

Hi John, thanks for the heads up, the CSS data will be useful, for instance I want to look at the use of outline:none.

AlastairC says:

Ah, great, so about 5% of sites trying to use HTML5 also used hgroup. Not many.

Steve Faulkner says:

Hi Alastair, of the 1454 pages using the HTML5 doctype, 77 also used hgroup so yes around 4.9 %.

karl says:

Sounds cool and interesting work you have started here. I have a little “hmmm”, because we all do the same thing when we try to do surveys on the Web. We often try only the Home page of Web sites. Which I guess might create a bias, I wonder if we should add at least for each of these sites a secondary page. The issue then being which one. 🙂

Sylvia Egger says:

Would be interesting if the pages in question are using WordPress. WordPress uses hgroup quite long now. Have to check your URL list to verify it. Thanks for the effort.

Steve Faulkner says:

hi Sylvia have a look at the generator.html it contains all pages with a meta name=generator. those pages that are identified as using wordpress are contained within the results.