Captions and Transcripts and Audio Descriptions, Oh My!

When individuals consider making video accessible, the accepted (sometimes only) considered solution is captioning. And with audio, the solution is transcripts. Yet, this approach may only partially address the needs of all individuals in certain situations. This article hopes to create awareness around the different methods for providing alternatives to audio and video media and when each should be considered. These alternatives include captions, transcripts, and audio description.


To ensure time-based media content is accessible to the most number of people possible:

  • For Audio-only content, such as recorded podcasts, interviews, etc., provide a transcript.
  • For Video-only content, such as animations or webcams, provide a transcript or an audio description, which describes the contents of the video.
  • For Video and Audio content, provide captions and a transcript. If information is communicated visually yet not through dialogue, provide audio description.


Captions are the textual representation of video content. They include:

  • spoken dialogue
  • speaker identification (unlike subtitles)
  • information about non-dialogue sounds, such as laughter, applause, etc.

Captions directly benefit the 466 million individuals worldwide living with disabling hearing loss, according to the World Health Organization. This number may be greater, as many individuals do not identify as being deaf or hard-of-hearing. Without captions, almost all video content is inaccessible to these users.

Captions indirectly benefit everyone. When an individual is temporarily hard-of-hearing due to noisy environments, such as restaurants, bars or airports, captions allow individuals with acceptable hearing to understand the information that is being presented visually, without having to hear the audio information. Additionally, think about the last time you forgot your headphones or earbuds for the train ride to work and wanted to enjoy watching a video?

Prerecorded Video

You must provide captions for audio content in synchronized video content (see 1.2.2 Captions (Prerecorded) Level A).

Here’s an example of a video with captions turned on followed by the captioned text with timings in SubRip Subtitle (.srt) file format.

YouTube video player with captions showing.
00:00:04,910 --> 00:00:11,550
the Chicago digital accessibility and

00:00:08,220 --> 00:00:15,330
inclusive design Meetup presents no bad

00:00:11,550 --> 00:00:18,570
Legos automated accessibility testing in

00:00:15,330 --> 00:00:21,619
component driven web development with

Captions should be in the same language as the video and synced with the audio content. They can be either open (always visible, embedded in video media) or closed (can be turned off or on.).

Captions for Live Video

Captions displayed on separate monitor.
Captions being presented during a conference talk on a separate monitor.

You must provide captions for audio content in synchronized live streaming video (see 1.2.4: Captions (Live)) at such events as conferences, meetups, and talks. Options for providing captions for live streamed events include onsite and remote Computer Aided Real-Time Transcription (CART) services and automated captions. The captions can be presented embedded with the video stream or accessed via a separate website. They also benefit attendees physically in attendance, when the captions are presented along with the content.

Automatic Captions

Automatic Captions or auto-captions are created not by humans, but by speech recognition technology. The quality will vary based on various factors, including sound quality, environmental sounds, bandwidth, speaker’s speech patterns, etc. As the goal of captions should be 100% accuracy, automatic captions should never be used as a permanent solution for providing an alternative to video. However, they can be used, temporarily, while a permanent solution is being generated.

Captions Summary

Captions provide baseline accessibility for video content, especially for those with hearing difficulties. However, they do not offer a solution for all individuals. For instance, for individuals who are deaf and have low-to-no vision, captions offer no benefit. Enter transcripts.


Transcripts provide a textual representation of audio and video content. They should include:

  • spoken dialog
  • information about significant non-dialogue sounds, such as laughter, applause, etc.
  • visual details, such as “showing a mock-up of your website”

Transcripts are the sole alternative for audio-only content, such as podcasts, radio show recordings, interviews, etc. (see 1.2.1 Audio-only and Video-only (Prerecorded) Level A). Transcripts are likely to be the only alternative for Deaf-blind individuals, who benefits from tactile solutions, such as using a refreshable braille display. However, transcripts can be used by everyone – in cases where the default presentation is not to one’s liking, it can be transformed into another format (say copied from the original transcript into MS Word and formatted by preference).

When a transcript is presented along with its audio or video source content, the timings should be synced with the source media. Below is an example of the transcript panel alongside the video on YouTube. Notice that the currently spoken text is highlighted.

YouTube video with transcripts panel displayed.
The transcripts panel displayed next to a YouTube video. Note that the timestamps can be toggled on/off.

Transcripts Summary

Transcripts provide an accessible alternative to audio content. When using transcripts along with captions, we addressed accessibility for many. However, there is still another large audience that is left out: individuals with low-to-no vision. While they can receive the spoken dialogue, there are times when important information is only being communicated visually in the video that must be described. For that, we need proper planning, and in some cases, audio description.

Audio Description

Audio descriptions enable those who are blind or visually impaired to receive the same information provided visually in media (see 1.2.5 Audio Description (Prerecorded) Level AA). Audio description is meant to provide information on visual content that is considered essential to the comprehension of the content. Audio description provides information about significant visual details that cannot be understood from the main soundtrack alone.

As an example, here’s a video with little dialogue and plenty of visual information. Play this video while not watching, and imagine what is taking place.

Audio Description fills in the gaps of information by describing critical sound elements, important visual actions, scene changes, and text on the screen during pauses in the dialogue. Here’s the same video with audio description. Again, play this video while not watching, and now think about the benefit audio description provides.

Audio description is required when information is presented only visually. For a movie with a decent budget, audio description is just part of producing a video, while quick corporate videos may not account for it. There are many cases where an audio description is required only due to the design or production of the video. Say you’ve created a video for your business and you’ve presented text on the screen, yet the audio track does not announce this text. As a remediation step, a new audio track would be required, an expensive proposition. If the video is planned and designed upfront to consider accessibility for all, audio clues for visual information would be a default approach.


There is even more that we can do to make our audio and video content accessible. Examples include providing synced sign language for pre-recorded media, extended audio description, where the video is paused to allow for audio description, linking to a transcript that includes both captions and audio description and providing captions for live audio-only content. However, by implementing the information provided in this article, you can be well on your way to WCAG 2.1 Level AA compliance.

For user insight into the impact these methods have, read David Swallow’s post “Sounding out the web: accessibility for deaf and hard of hearing people [Part 2].”


Categories: Development


Cindy says:

Re: “For Video and Audio content, provide captions and a transcript. If information is communicated visually yet not through dialogue, provide audio description.”

Doesn’t captioning and audio description
meet Level AA conformance if the information is communicated visually yet not through dialogue? Do we need transcripts as well? It’s my understanding that transcripts may be provided but are not required for AA.

Based on the scenario you describe, “information communicated visually, yet not through dialogue”, captions and audio description, together, would technically meet WCAG Level AA conformance. However, this leaves a gap not covered by WCAG;

Non-sighted and Deaf users unable to see or hear the content. Transcripts would provide the only means of obtaining this information.
For cognitive users, transcripts would provide flexibility in how they digest the content.
For all users, flexibility in how they obtain the content, to suit their needs.

Joe Snell says:

Great post!

You may want to change your page’s “og:description”. It is currently displaying css.

Joe — fixed now. Thanks!