How to Create Transcripts That Benefit Accessibility and SEO with Jess Schmidt - Episode 109

How are you creating impactful transcripts for your show?

You’ve probably seen a bad podcast transcript. Maybe the creator was lazy in their transcription, maybe the AI was flummoxed by the proper nouns, punctuation, or audio labels—whatever the reason, such sub-par writing begs the question: is there even a point to having transcripts if they aren’t very good?

That’s one of the queries Mary poses to podcast producer and consultant Jess Schmidt. It turns out Jess isn’t just an expert in the industry, she’s also a font of historical transcript knowledge, thanks in part to her past career generating closed captions for live television. Jess and Mary get into the importance of timestamps, SEO, and accessibility. They tackle the many issues facing platforms and creators alike as automated transcripts take centre stage and the continued importance of human eyes when it comes to rendering written versions of human speech. If you’re not already a transcript nerd, you will be after this episode!

Learn about the future of transcripts from the industry’s storied past:

The limitations of and issues with AI-generated transcripts on hosting platforms;
How closed captioning is similar and different from transcripts;
The fascinating history of YouTube’s AI-generated “craptions”;
Jess’s hopeful daydream for the future of podcast transcription.

Links worth mentioning from the episode:

Read Jess’s op-ed, “Podcasting’s Money Problem”
WIRED, “The Problem with YouTube’s Terrible Closed ‘Craptions’”
The origin of automatic captioning on YouTube
Episode 69, “Intangible Values of a Podcast”
Episode 88, “Accessibility in Podcasting for Hard-of-Hearing Listeners with Kellina Powell”
Episode 106, “Accessibility and Ethics in Podcasting with Meg Wilcox”

Engage with Jess Schmidt:

Learn more about Jess’ work
Connect with Jess on LinkedIn

Connect with Mary!

Get curious on your podcasting journey – book a 30-minute complimentary strategy session
Send feedback with a voice note through the “Send Voicemail” purple button to the right of this webpage
Or email your feedback to Mary at VisibleVoicePodcast@gmail.com
Read up on more secrets with the Visible Voice Insights Newsletter
Link up and connect on LinkedIn
Engage on Instagram @OrganizedSoundProductions

Show Credits:

Podcast audio design, engineering, and edited by Mary Chan of Organized Sound Productions
Show notes written by Shannon Kirk of Right Words Studio
Post-production support by Kristalee Forre of Forre You VA
Podcast cover art by Emily Johnston of Artio Design Co.

Transcript with Audio Description

<< DRIVING THEME STARTS – GHOSTHOOD FEATURING SARA AZRIEL “LET’S GO” BEGINS >>

MARY: Back in my grade 12 history class, I remember my teacher telling us that we learn about history so that we don’t repeat the mistakes of the past. This is also very true in podcasting, although the industry is quite young in comparison to other traditional media like radio, TV, things like that. It does have some history to pull from, though. It’s not like we were born yesterday.

In this case, I’m talking specifically about captions and transcripts for podcasts. I was thinking about how captions for like TV, movies, video, that sort of form of media, they do have their own version of transcripts captions. So that got me wondering, well, how are they different and yet how are they actually really alike so that we can utilize what we know about closed captioning and use it for podcast transcripts.

So last year at PodSummit, when I met Jess Schmidt, I knew I had to bring her on the show. You’ll want to make sure you listen in in this episode for her really amazing mini history lesson in captions, or actually “craptions” that YouTube originally created when they decided to add automatic captions to their videos. This was a tipping point for YouTube. And so with Jess’s background from the closed captioning world, it collides with her current work as a podcast producer, consultant and an instructor based in Calgary, Alberta, Canada, fellow Canadian. So she’s got a lot of insights that even I was like, whoa, right, that happened. And this can now also happen in podcasting. And with all the changes that are constantly happening in podcasting, I would love to see some of these wishful thinking that Jess, and I dream up of at the end of the episode. So make sure you listen for that.

Fun fact, Jess and I also both fell in love with podcasting back in 2014. We didn’t know this, but we were both fans listening to Serial, the first podcast that really got us into podcasting. For me, it was when I used to actually have to download each episode, which took, seemed like forever, and save it onto my phone and then play it off my phone’s speaker for long car rides. Because my car quite old as well, so I couldn’t connect my phone to the car speaker. And so it was just like playing at max on my phone level listening to Serial. I love how technology has changed, right? Podcasting has really changed since then.

And so there’ll be lots more changes to come because it’s the wild, wild west out here in podcasting. So listen in to this episode for the major differences in transcripts versus closed captioning and now, how these transcripts play a role in the big podcasting player apps like Spotify and Apple Podcasts, and really what this could all mean for you and the future of transcripts and accessibility in podcasting.

This is episode 109 with Jess Schmidt on The Podcaster’s Guide to a Visible Voice.

<< WOMAN SINGS: So so so so let’s go >>

Jess, welcome to the show. I cannot wait to laugh more and talk more with you.

[MUSIC FADES AND ENDS]

JESS: Mary, it is such a pleasure to be here. Thanks for inviting me.

MARY: Okay, so when we first met, like, finally in real life, because, as we said, we stalked each other for the longest time, I suppose.

JESS: Yeah. Online friends.

MARY: Online friends. Last year at PodSummit, we had this great connection, and you were telling me about your background of closed captioning.

JESS: Mhmm.

MARY: So then I was like, wait a minute. How is closed captioning different from transcripts? Is there a, uh, difference? Like, when I think of closed captioning, you know, it’s like, on TV, right? Like, it’s the subtitles that you read on TV and then there’s subtitles on YouTube, and now there’s transcripts on podcasts. For the layperson, they might feel like they’re all the same, but is it the same?

JESS: I mean, in what they are accomplishing? Yeah, they’re the same thing. They’re making accessibility possible, especially for people who are hard of hearing or deaf, even if you’re not belonging to one of those groups, like, if you’re just somewhere busy. Like, the example I give for trying to explain what closed captioning is, especially while I was still doing it, was like, you know, the words at the bottom of the TV when you’re, like, at the gym and the TV screen is up and, like, they don’t have any sound on, or you’re at the dentist and they don’t have any sound on. Like, those are the closed captions. That’s what I used to do.

Like, it’s the same function. Like, you’re just translating the spoken word into written text. I mean, there are some differences. If you’re thinking about, like, kind of the key difference between what we would consider a closed caption versus what we consider a transcript is, like, the temporal aspect of it. Like, the time lock of a transcript doesn’t have to have any timestamps on it necessarily. It can still be a transcript, even if it’s just words. But captions most of the time are locked so that they are scrolling out kind of in time with what’s happening.

MARY: Oh.

JESS: So that’s one of the things that makes. Like, if you’re thinking about, like, Is this a caption file versus a transcript like that? Like the time element can be a differentiator. I mean, I guess you could technically have like a live transcript, but one of the big differences is that closed captions, uh, not all the time because obviously there’s post production, like TV and movies and stuff.

MARY: Mhmm.

JESS: Like if you’re watching on Netflix and you turn on captions, those are probably post production. Like somebody was able to get the file listen to. Wasn’t a rush to do it. Well, maybe it was a rush, but it’s not live.

MARY: Right.

JESS: But then if you have live content, like, that was actually how I started in captioning was as a live closed captioner, where you get like a pre-feed of, let’s say, the news, because that’s obviously not scripted, it’s happening live. So you get like a tiny bit of a pre-feed before it goes out to broadcast. You listen to it, you generate a closed caption that’s going to go along with it. It’s got a little bit of lag, just because, you know, that’s how time works. But for the most part it’s going out so that you’re seeing a caption that hopefully pretty closely matches what the hearing experience of that live content is going to be.

MARY: Yeah, and I wonder too if, like, that’s where places like Apple podcasts and Spotify on the app. It’s kind of like that karaoke feeling of the transcript. It is timed, so I guess they’ve timed it in their system. So to me, as a layperson, I’m like, it feels very much the same as if I was watching a show on TV with the closed captioning. I can, if I was on the bus and didn’t have my earbuds on, right? Like, I, I could still read the transcript on the player,…

JESS: Yeah.

MARY: …on the podcast player. I guess that it’s similar enough.

JESS: Yeah, and this is actually like a good point because, I mean, we can get into like the real technical. I’ve like, really gotten like all the way up to my elbows into, what Apple and Spotify rolled out for AI captions…

MARY: Yeah.

JESS: …And it’s actually, it’s interesting because, I mean, one of the distinctions I made to be like, oh, a transcript doesn’t have to have timestamps. Actually a good transcript should have timestamps is one thing that I should say. And the other thing is that when you think about like, oh, I’ve generated a transcript for a show, like, let’s say, even for like, your own podcast that you’re working on, like, oh, I generated this transcript for it. Let’s say you really wanted it to be like 100% accurate. You don’t want to use AI, you don’t have any other training, you’ve truly just like listened to your episode and like hand bombed out, like just typed it.

MARY: That’s a good amount of work.

JESS: It’s a good amount of work. And then if you wanted to plug it into Apple, you actually couldn’t because Apple doesn’t accept text files. Apple only accepts VTT or SRT files,…

MARY: Right.

JESS: …which are transcript extensions that have time codes built into them. So you would have to have also added in all of those time codes while you are handwriting out this transcript, which is already just transcribing it by hand is a lot of work. Like I, as somebody who worked in the industry would not also be hand typing out time codes. Like it’s just not a good use of your time. It’s too much work.

MARY: Is there then a way to still write out into a document and then match it up to time? Like how, how do we then create a proper SRT? Because you can create an SRT just by, you know, converting a text file,…

JESS: Mhm.

MARY: …even though it doesn’t have any timestamps to it.

JESS: Mhm.

MARY: So do you, do you know of a way?

JESS: Yeah, I mean the thing that is the easiest to do is to use an AI tool.

MARY: Well yes.

JESS: But I mean if you wanted to do it without an AI tool, like there’s other softwares that you can use that, and this is like, how it works in the caption industry, like I talked about, like this is what it’s like for live is you get the pre-feed, you do the captions, your captions are sort of matched already temporally with the way that you’re generating them. If you’re doing post production work. So it’s not live, you’ve just been given the file. Like what I was talking about for Netflix in that situation, normally what they would do and like it’s been a while since I’ve worked in this industry, but when I worked there, the protocol was you took the file, you made the transcript, you would export the transcript wholly without any timestamps, you would port it into a software that you manually would play the video, listen to the audio, and then you would manually stamp out where each line of text should start….

MARY: Oh wow yeah.

JESS: ….So you could get going pretty fast at it. But it was like a manual thing that you had to do. And it was a two step process of you make the transcript. And sometimes it was different people. Like, sometimes it would be one person makes the transcript, you hand it off to somebody else who is working in the timestamping pool. And then they put the timestamps on it. And it’s actually like a pretty fast process. But yeah, like it was. But it was fully a manual process. But it was like, there’s like a whole industry dedicated to this. It took me months of training to get good at both of these halves of it.

And so like, that to me is one of the things that makes it really complicated for creators making their own transcripts is even myself having all that training. As soon as I left that job, I didn’t have those tools anymore. Of like, here’s my voicing software that I was using to generate transcripts. Here’s my timestamping software that I was using to then timestamp the transcript. Like it was, even though I have the experience, I was in kind of the same boat as everybody else of like, how do I ad hoc this? And it’s hard.

MARY: Yeah. And I think that that’s why it usually gets left to the wayside. They’re like, I don’t have time for this. I know I quote unquote should be doing this, you know, for that accessibility reach. But, you know, like, we had been talking previously in our another conversation that Apple and Spotify isn’t solely doing this just for accessibility. There’s also the discoverability, the SEO tools and all of that. So touch on what you believe the transcripts are actually useful for in podcasting today.

JESS: Yeah, I mean, I don’t want to downplay the fact that it is adding some element of accessibility. I don’t want to overstate it being a really good accessibility tool because, you know, especially as somebody who was in the industry and has kind of like since then also kept an eye on how AI transcripts are going. Like, there are some things that it does really well, and there’s some things that it just really, it doesn’t do well and some things that it probably will never be able to do.

So, like, things like nouns are notorious. Like, it is very difficult to get AI to understand nouns, and it’s because there are just so many proper nouns in the world that there’s just no way for an AI to know all of the proper nouns and to be able to recognize why, like, especially like homonyms, like, why do I have this one instead of this one? And things like punctuation. Like there are rules that you can teach an AI that it can generally get right a lot of the time. But English is not the only language that has really wacky punctuation rules. And again, those are like really difficult nuances to teach human beings nevermind also AI.

MARY: Yes. Also, I mean, and then especially something like this when we’re having a conversation, nobody speaks in punctuation….

JESS: Exactly.

MARY: …I don’t say period, comma, em, dash, you know.

JESS: Yes, exactly. Just doesn’t happen. Yes, exactly. And so those things, I think are always going to be a struggle. I mean, there’s other things too. I’m going to get to the SEO question, but there’s other things too to think about of what gets added into a transcript that like, AI is never going to be able to do. Things like sound labels. Sound labels are incredibly important because they’re the only way to reflect any choices you would have made in sound design, in a transcript, in a written version of any of the podcast, or any of the media that you are taking from audio and putting into text.

MARY: I love that sound labels, like, I’ve been calling them audio descriptions because I hadn’t figured out what, like, is there a name for it?

JESS: I mean, the captioning term is sound labels,…

MARY: Sound labels.

JESS: …but you can call them whatever.

MARY: Yeah, I mean, that’s true. But I, like, that’s one of the things that I find if you’re going to be doing a transcript, it needs to have those sound labels that audio description, because the whole podcasting audio world is based on sound. So if somebody needs that accessibility part of being immersed in your show, they need to hear in words what the podcast sounds like, right? If you’ve got like, an exciting piece of music or a sad, somber piece of music, or if there’s a lot of laughter or the sound effects that you’re adding in there, or even like sarcasm. You can’t always read what sarcasm is in text.

JESS: Yes.

MARY: So what are some of those other sound labels that I might have been missing out of that list?

JESS: I think that those are, that’s like a really good, like, general, like the things that I think about are like delivery tone. So those would be things like laughter or like, sarcastically. You could put in a sound label music, obviously, and having it not just be like music plays, like describing, like you said, like describing the kind of music. You probably put a lot of thought into the kind of music you want in your podcast.

MARY: Yeah.

JESS: And if you just say music plays, then you’re not going to convey anything other than the fact that there is music, you know?

MARY: Right.

JESS: Like, what kind of music, did you pick, like, some other things to think about are. And this isn’t every single podcast. Like, if you’re doing kind of like a, you know, like a more straight chat cast, like what we’re doing here, maybe you don’t have, like, really wacky sound effects happening, but if you have things like sound effects, those are not going to be put into a transcript. When it’s an AI transcript. Like, if you have,…

MARY: Yeah. They’re never going to get that.

JESS: …They’re never going to get that. If anything, they’re going to, maybe if it sounds a little bit like speaking, maybe they’ll be able to turn it into something that it absolutely is not. If it’s thinking, it’s a human voice, not a sound effect, but that’s. That’s just incorrect. So you kind of don’t want it to do that. So, yeah, I mean, just like from the sound design perspective, like being able to convey any of the sound design elements, including like, diction, it’s not possible unless you are having a human add in things like sound labels into a transcript.

There’s other limitations too. Where like, we know that the voices that the models have trained on typically represent certain parts of population and things like, accents, even gender, regionality, like, all of those things can have a huge impact. Like, I’m sure people have had it where they have a guest, the guest is great. Maybe they have an accent. Maybe English is not their first language. If you run it through, like, through a transcription engine, sometimes it can get it, and sometimes it’s like, oh, this is illegible.

MARY: Right, yes.

JESS: Like, you would have been able to hear it while they are speaking because, you know, it’s not enough to stop the conveyance of meaning while you’re talking. But it was something that the transcription could not understand. So [QUICK SIGH] those are, to me, like, the worries and the limitations. We weren’t quite there yet. But that to me is like, I don’t want to talk about some of the good things that it is doing without also framing it in. These are like the red flags that worry me.

That being said, [LAUGHTER] there are some things that it does that I think is really positive. Like, one of the things is, I think normalizing transcripts and making them a minimum requirement, even if the transcript isn’t great. I think the idea of, oh you should be thinking about this is generally a value add for the industry…

MARY: Hmm, yeah.

JESS: …and making it so that it’s like, okay, where do I even start? Like, this is such a heavy lift. Oh, Apple gave me this transcript is like leagues ahead of where we were prior to that. So that’s good because I think that the framing around it is accessibility and that sort of is the feel good piece. But from a technical perspective, and I know that you’ve talked about this before on the show, is the search engine optimization piece. Because the Internet is at this point majority text based. So when you’re searching things in Google, it’s not scrubbing for audio or video, it’s scrubbing for text. And that’s why like SEO optimization, like tags and keywords, and that exists in podcasting, it exists in like web design, it’s in all these different touchpoints on the Internet, because that’s how the Internet indexes things, is through written language.

And so, if you have a podcast that’s just audio, there’s no way of indexing it. The only pieces of it that you can index are things like the show notes and the title, which is good. And a lot of the time people are writing those to try and soak up that like, SEO Google juice.

MARY: Yes.

JESS: But there’s not, there’s so much like if you have a 60 minute episode and you have, you know, like a hundred word blurb that is like, your episode description, there’s obviously a huge gulf between what you’ve written in the description and what the actual content of the episode was. And that’s like, again, the SEO piece of it is tough because one of the things that SEO really thrives on and like, benefits from is proper nouns.

MARY: Yes.

JESS: And that’s something that transcripts can’t do. AI transcripts can’t do that. So it is like these gaps already existed. We’re seeing these gaps start to get filled in a little bit. We’re seeing moves towards normalizing people having to care about this and making it like part of your workflow and part of the mental load of how am I putting together the packaging for the content I’m distributing on the Internet? And yet there are still these fairly substantial gaps that still need to be filled within that larger gap. And Apple’s doing some of the work, but then it also is making it kind of hard for you to fill in the rest of the gap, which sucks.

MARY: Yeah, I was thinking too about like how, okay, Apple Podcast transcripts is fairly new to the industry. Like this, they only started automatic transcripts in March of 2024. So that’s only two years ago. Spotify started their own in September of 2023. And I was thinking back to an episode I had with Kellina Powell. She’s the deaf queen boss, and she uses accessibility tools to communicate and that’s how we recorded the podcast with her. She lip reads. And so when I asked her, okay, then do you listen or how do you listen for audio only podcasts? And she’s like, well I don’t. I use YouTube because there’s the lip reading aspect. So one point to video there, right? Like, you can’t do the lip reading without the video aspect.

But then I also realized she uses YouTube because of the captions as well. So then I was like, wait a minute, when did YouTube start doing automatic captioning? And I found on their blog from December 4, 2009. So, 2009 to now we still haven’t gotten the nouns right. YouTube has done it probably as well to index and to figure out how to search through video on the Internet. And I think Apple and Spotify are just catching up to this aspect. And they’re like, wait a minute, if YouTube’s going into podcasting, we need to do what’s YouTube doing. And so now we’ve, we’re like mashing these two worlds together trying to figure out how to index podcasts.

JESS: Yeah, YouTube’s actually a really interesting case study in this too. Because they were so early to the game…

MARY: Yeah.

JESS: …in terms of like AI generating captions, but they actually ended up causing like a minor protest and like an uproar for YouTube users because they started rolling out this like, AI generated auto caption feature, like again, very early. Like they were like a decade and a half earlier to the game than Apple and Spotify have been. But they were so bad,…

MARY: Oh no.

JESS: …it was such low quality that people like, gave it a new name like, and it was prolific enough that people started calling them “craptions”.

MARY: Oh, that’s good.

JESS: And they coined this term specifically because of how bad the YouTube captions were initially and the uproar that was caused by these captions. Like YouTube saying, oh, we’re doing accessibility now. People who are hard of hearing or like have accessibility needs. We’re helping you out by making these auto generated captions. And people were like, truly, like, this is actually worse, like having an incorrect caption, especially for like even just think about it in the specific instance of somebody who can lip read that you’re lip reading something and then you’re getting something in text underneath it that’s incorrect,…

MARY: Yeah.

JESS: …but could be close enough, then it actually is going to skew your perception of like, you felt confident that you knew what they were saying while you were lip reading, but then you’re getting this false input of something that’s incorrect. Textually, it caused enough of an uproar that YouTube actually, like, shuttered it momentarily.

Like, they kind of took back or like, they stopped promoting it as much, and they were like, okay, we got to get to work. And they did make it better. And it was based on people’s feedback of being like, this is worse than nothing. Like, you’ve given us something that’s worse than nothing. Get rid of it or fix it. And they fixed it.

MARY: Oh, wow, that’s amazing.

JESS: Yeah.

MARY: Thank you for that history lesson.

JESS: Hey, that’s what I’m here for the history of “craptions” on YouTube.

MARY: It kind of reminds me in, like, okay, my daughter is 10, and so I remember she was in daycare at the time. And I feel like maybe it’s just the technology of the time, right? Where the AI voice to text isn’t still isn’t great, but was poor back then. Again, 10 years ago or so, I had a voice to text feature for my cell phone plan. And so whenever I got a voicemail, it automatically generated a text, and it would text me what the voice message said. And here I am hanging out with two friends, daughter’s at daycare, and I get a voicemail, and I’m like, oh, I just got the text, and it said, can you please come to daycare to pick up your daughter? She has a seizure.

[JESS GASPS]

MARY: And I was like, oh, my gosh. Oh, my gosh. I have to go. I have to. And then I quickly, like, listened to the voicemail, and they actually said, oh, your daughter has a fever.

JESS: Oh, my god. Yeah. Like, this is exactly what I’m talking about.

MARY: Right? Like, that’s horrible!

JESS: Yes, yes, yes. Like, okay, obviously we don’t want your daughter to have a fever, but that is preferable to a seizure.

MARY: Yes.

JESS: Absolutely. Yeah. And see, this is like, a really good example of, like, concrete example of sometimes it being there,…

MARY: Yeah.

JESS: …but being incorrect is more detrimental than just not having it there in the first place.

MARY: Yeah. My heart’s racing right now as I relive that moment.

JESS: Yes. Well, yeah, like, that, you saying that made my stomach drop. I was like, oh, my god. Like, that truly is like, the, one of the scariest things to deal with as a parent, I’m sure is like, oh, my kid is not well. And then to immediately have it be like, that was not even correct. Wow, thanks so much, text to.

MARY: Thank you. Yeah. Voice to text feature.

JESS: Yeah. Brutal.

MARY: But that’s what this is now, right?

JESS: It could be yeah.

MARY: Like, or a version of that over the years. Now we’re using probably that same or evolved technology for podcasting transcripts, because it’s the easy way to go. People will take their show, plug it into an AI transcription feature, and then they’re like, okay, good enough, this is it. When, okay, then how do we, as podcasters and we’re so busy with so many things, do we use that or how do we call for something better? Like that YouTube back in the day?

JESS: Yeah, I think you have to go into it with the YouTube craption attitude where you have to be like, okay, I see what you’re doing here. You know, points for trying, but here are the things that we could do to make it better. And to me, the thing to make it better is not, you know, Apple is never going to like hire a transcript proofer or whatever to proof…

MARY: Yes.

JESS: …the millions of transcripts that are from the millions of podcasts that are host that are on Apple. But I think that there are some things that they could do to make like vetting a transcript easier. Like for instance, it’s actually sort of difficult. Theoretically it’s possible that you could take an auto transcribed transcript that Apple made for you and pull it out as the VTT. I think you could do this. It was, I remember trying to do this and I was like, this is really hard actually. Like it was really difficult to do. Oh, I remember actually the problem with it was when you go into your settings on Apple, you can either say, I want to opt in to have all of my transcripts done with AI,…

MARY: Right, mhmm.

JESS: …or you say, I want to manually upload them. Like to me, having to go into the back end of Apple, toggling that back and forth, trying to find your VTT file that it would have generated for the transcript, pulling that out, editing the transcript, and then going back into the settings of Apple and saying, nevermind, I don’t want you to generate transcripts for me, I want you to take the uploaded file. Then you have to take that uploaded file. You have to take it to wherever your podcast is hosted. You’re not putting it into Apple. It’s something that gets added to your RSS feed, where the hosting is happening.

MARY: And not all hosts have that feature.

JESS: And not all hosts have that feature, or some of the hosts have that feature. But like Acast for instance, obviously Acast has transcript hosting, but because Acast has dynamic ad insertion baked into the way that they are putting out episodes. Like if you are routing a feed and Apple can see that it’s originating from Acast, it’s automatically just going to give you a transcript. Like you don’t have an option to opt in of I’d prefer to supply my own transcript.

Even if you have transcripts on your Acast RSS feed, it won’t pull that transcript over because it knows that it’s a dynamic insertion podcast and there would have been a dynamic insertion somewhere in the transcript. And so, it’s not static. It only privileges static transcripts. So if you have any dynamic content, really, you can’t really opt into this system of, I would prefer to give you my transcript.

MARY: Oh, I forgot about that aspect, yeah.

JESS: Yeah. And Acast, I’m not trying to single them out. I just thought that was really interesting that, like, they’re on the list. If you go into Apple’s backend and are like, okay, what are the different services that are available from these different hosting sites? And what can I pull over? Acast is on there and transcripts are crossed out. And I was like, oh, this has to be because of the dynamic insertion. Like, I can’t think of any other reason, because they do transcripts.

MARY: Yeah.

JESS: So it’s like, it’s stuff like that where it’s gesturing towards, oh, we want to do this. And we want it to be both accessibility. And it’s also interesting, I remember looking at Apple when they first kind of rolled out, like, what are the rules around these transcripts? And like, how are they thinking about ingested transcripts versus, like their own AI generated transcripts? And I’m going to admit that I haven’t, like, checked back to see if they changed anything. I’d be sort of surprised if they did. But when I first looked at it, so this was back in probably like April 2024, like a month after they’d announced it, I was putting together like a talk because I really had to be in my body about this whole, [LAUGHTER] like, Apple podcast thing.

MARY: Yeah.

JESS: And two of the things that they said that if they’re ingesting a transcript that you’re supplying off of an RSS feed, they had an expectation that transcripts had to be subject to quality standards and that files that did not meet standards would not be displayed, which I thought was hilarious, because at no point did they ever define their quality standards.

MARY: That’s what I’m wondering.

JESS: They just have an expectation for your quality standards.

MARY: What’s standard? What is standard?

JESS: Well, uh, that would be cool if they defined it.

MARY: Yeah.

JESS: But they did not. They did not define it. They just said if you’re not meeting a standard that has been undefined, then they don’t have to display your transcript, which is. Which, you know, very good legalese. I’m sure they did it to cover their butt, but like, to me, that’s a huge question mark. And then the other thing that they said is that if you provide your own transcripts, it’s expected that you make sure that your transcripts are free of spelling and punctuation errors. And those are two of the things that are really, really difficult for AI to do.

So there’s a 100% guarantee that the Apple transcripts that are AI generated. And I’m not saying this to like, poo poo on Apple, but like, truly, this is just one of the limitations of this AI tech is that it does not know how to do proper nouns most of the time because proper nouns are incredibly hard. And it does not know how to do punctuation because punctuation is incredibly hard. So this is another thing that, to me, it’s good that they’re gesturing towards. We want to make this accessible. We want to be part of, you know, supporting more access. We want to be part of, like, spreading the SEO. But then they’re adding in all these barriers of if you want to provide your own, then we are going to hold you to a certain standard. And it’s like, but you’re not even holding your AI transcripts to that standard. So what does that say about what, what your actual goal here is and what you’re going to accomplish with it?

MARY: Yeah, and, and we’re picking on Apple just also because they are the leader in the industry.

JESS: Yeah.

MARY: People know of Apple podcasts. They use it. It is so well known. And so as a leader in the space, like, yeah, I’m wondering what are you doing to improve that accessibility aspect of it?

JESS: I want to foil this also with Spotify because we’ve kind of left Spotify out of the conversation. And actually Apple’s done this a lot better than Spotify has, where they, you know, as much as I’m like, well, what is your quality standard Apple? They did release stuff around, like, here are, uh, what our expectations are, here’s how it’s going to work. Like, they opened it so that you can see a little bit more of the back end. You can opt in and out. Spotify has done this completely black box. Like, Spotify did not announce that it was going to start doing captions. It just started rolling them out. I don’t even know, like, it must be an SRT or a VTT because they are like, temporally locked into the audio, but they actually don’t. As far as, like, the last time I checked, you can’t opt in or out of Spotify captions. You can’t say Spotify, please take, uh, transcripts that I have stored on my RSS feed.

Like, Spotify, unlike Apple, was like, this is not a conversation. We’re doing this. And they also, like, even when they rolled out their captions, it wasn’t across the board, like, Apple put out an announcement that was like, okay, we’re doing this. We’re going to start with most recent uploads to feeds. We’re going to start with feeds that are most active. If in a couple weeks or a couple months you still don’t have any transcripts available for your show, like, contact support, and we’ll get you added to the list because we want to make sure that this act like we are working through the back catalog, which is a huge undertaking. Spotify was just like, we’re not going to say anything, but we’re doing this now. Like, they rolled it out really spottily. I don’t know how they decided who was going to get captions and who wasn’t.

MARY: Well, it was their exclusive shows first.

JESS: First it was the exclusive. Yeah, exactly. And then the rest of all of us peasants just kind of got it if we got it.

MARY: Yeah.

JESS: And there wasn’t. There was no, like, oh, if you want them, uh, here’s who you contact. Like, it was really like, if you are blessed enough to have the captions already, good for you. We, you have been selected. And if not, then, like, no other news. Like, I don’t know. Like, I don’t even know if you could have contacted them and been like, hey, where am I? Where are my captions? Because they have done it fully, fully black box. So obviously they’re opening themselves up to less scrutiny.

So I think that what Apple’s done is better in comparison to Spotify. And there’s like, so many differences between Apple and Spotify. We don’t have to get into it now. But I think that Apple really, like, does try to establish themselves as a, like, thought leader. And they’re like, we want to be like, the head of the space. I mean, they’re not anymore. We’ve seen that Spotify is getting more of the podcasting share, in particular, in the last. Last, like, year or whatever was when they cracked it, I think. But it’s interesting that, like, Spotify was first to bat over Apple, but is not making it a conversation. It’s just something that they’re doing. So I don’t think that’s better. I think that what Apple is doing is better. I think there’s still space.

MARY: Oh yeah, there’s always room for improvement.

JESS: Yeah.

MARY: And that’s why I wonder too about like the future beyond just these automated transcripts. What, what do you feel like that could look like in rose colored glasses here a little bit. [LAUGHTER]

JESS: I mean, the thing that I would love to come out of this is that all of these hosting sites make it easy for you to take that AI transcript that they are already generating because clearly there’s something in their backend, the SEO probably that is benefiting from it. It would be amazing if they made it possible for you to take that transcript out. Or maybe they introduced something like, native to their own, I don’t even know how they would do this, but maybe there’s a way to do it. Rose colored glasses. Think big here. If they could make it so that you could go in and edit your transcript so that it just looked the way that it was supposed to. You could add in things like sound labels. It maintained like, the temporal pacing. It stayed as a VTT or SRT file, but it gave you kind of that like, raw starting point for you to edit. That to me would get rid of a bunch of the hurdle that currently exists of just having to generate that for yourself in the first place.

And then if they’re already going to that effort, like maybe then they could make it easier for there to be a relationship of like, okay, great, like we want the creator to have a say in the way that it’s being presented. Like, I don’t know, I don’t know exactly what this would look like. Maybe it has to happen like further upstream when you are doing an upload. But I don’t know, like, I just think they could put up less roadblocks and they could make it more of like, a collaborative thing. Like this is one of the big things of accessibility is if you just start doing accessibility, thinking that you know better from like, an ivory tower perspective. Like, this is the problem that happened with YouTube and the craptions initially is they were like, okay, we can just do this. And then when the audience who was using them was like, actually this is worse. They actually did listen to them and they like did make it more collaborative.

And actually for a while, I don’t know if they do it anymore, I haven’t checked. But for a while, I don’t know if you remember this. On YouTube you could edit captions while you were watching something if it was incorrect. So it was sort of this like, pooled community effort of raising the value of captions. And like, actually, that is something that happens sometimes in podcasting. I mean, I’ve heard of it more frequently in sort of like, not for profit space. Like, if a not for profit has a podcast and then there’s someone who, like, volunteers with the not for profit and they really end up falling in love with their podcast, then sometimes it’ll be like free labour of them generating transcripts for them, for instance.

But I think that this could be like a cool pooled community thing. Like, even if we think about Wikipedia, like, you know, the Wikipedia of, you know, you’re in my early days where it was like, [LAUGHTER] you were. I remember being in college and people being like, you absolutely should not be using Wikipedia at all ever. For anything like that. You would want to consider academic…

MARY: Right, you can’t cite.

JESS: …you can’t cite Wikipedia and I wouldn’t cite Wikipedia now, just to be clear. But the amount of citation that goes into Wikipedia and the amount of, like, collaborative, like, knowledge that exists there is to me, like, one of the best parts of the Internet. So I sort of wonder if we could do something like that with transcripts where either, you know, Apple, Spotify, whoever, makes it easier for the creators to be able to at least edit existing AI transcripts and make them actually functional, or like, make it even broader than that and open it up to the community and make it sort of like the YouTube captions of it can be community edited. Make it like Wikipedia of like, oh, that wasn’t correct. This is the wrong apostrophe in this transcript.

Not that that really matters necessarily, but to me it does. As someone who really, really cares about grammar, and I don’t always catch it, like, I did this for a living and the threshold is 97% accuracy, it’s very difficult for one human being to get higher than that. But if you pool resources and have multiple 97% accuracy human beings looking at it together, then you are going to get it to 100% because those three percents are going to look very different for everybody else of what gets lost. So I think that, like, dream world, we just do communal transcript editing that is, like, based in the community.

MARY: I love that, yeah.

JESS: I don’t know if it’s going to happen, but I’d love it.

MARY: That would be so great too, because I’m personally not an Apple user. So, like, I never go into Apple to look at the transcript except for, like, I do have an iPad for my daughter. So, like, I could look if I wanted to. But because they do their own transcripts and then Spotify does their own transcript. Like, in my ideal world, I would love that community aspect. Or to be able to somehow get it from your RSS feed, where you can just change it once on your RSS feed and then it changes across the board for everybody. And they’re, you know, okay with that. Because as a creator, I don’t have time to go into Apple, and Spotify, and YouTube, and then change each time if I say my name too fast, it says Maryanne instead of Mary Chan, you know?

JESS: Yeah, I have the Jess Schmidt. Very difficult to say…

MARY: Yeah.

JESS: …transcriptions. Do not like it. I understand your pain.

[LAUGHTER]

MARY: Jess, this has been really enlightening. Like, there’s been a lot of times where I’m like, oh, my gosh, I did not know that. So, thank you for sharing your brain with us.

JESS: Oh, I’m so happy to. I love nerding out about transcripts and captions.

MARY: But, you know, we talked about the future and stuff. But what about, like, right now? As I close off right now, we are recording in February 2026. What is exciting for you about podcasting doesn’t have to be transcripts.

JESS: I mean, to keep it in transcripts. I think the fact that people care about transcripts and want to talk about them is really cool. And like. Like, I’ve always had transcripts just stored on my websites for my shows, especially because that helps, like, your actual website that you made for the show be indexable. This is kind of cheeky, but I’ve done it on some of my own shows that I do that, I’ll say, you know, if you’re using Apple or Spotify right now and you have and you’re, like, using the captions or the transcript and you’re upset that the quality isn’t good. I have a proofed transcript on the website, so that’s always another option to circumvent and just, like, make it clear to people where to go. So I think it’s cool to, like, even just get to have these conversations.

I mean, what are the things that I think are interesting in podcasting right now? I am really anti capitalism, so,…

[LAUGHTER]

MARY: Yeah.

JESS: …like, getting to. As, you know, like, Mary, you and I talk about it. I think it’s, like, a really interesting time, especially in Canadian podcasting right now. I think that there’s, like, like, some really cool rumblings happening. And I just did an op ed about why is the metric for podcasting success sales?

MARY: I love that one. Yes. Yes.

JESS: Yes, thank you. Like, a lot of podcasts are not selling anything. Like, why would you use sales to measure your success, that makes no sense to me. And that was, like, exceptionally well received. And I had so many people be like, yes, thank you for saying that. I think about this all the time. So I don’t know if anything’s going to come of that. I mean, I’m a very, like, burn it down.

MARY: Yes. [LAUGHTER]

JESS: Let’s start something new and better. I don’t know if podcasting is, like, gonna be able to fix everything, but, like, I don’t know. I love podcasting. And if podcasting is the first one to be like, we need to think about the way we create and, like, value creativity differently. Like, I’m down for that.

MARY: Yeah. Hear, hear. [LAUGHTER]

JESS: Start a revolution is what…

MARY: Yes.

JESS: …I’m excited about podcasting right now.

MARY: Yeah. Because, like, podcasting, it’s the same. When I worked in radioR, rdio is a business, you know, there’s a whole sales team, and they’re always trying to sell radio ad time. And it’s just such an intangible product that the good salespeople knew how to sell radio. But then I was like, we’re not selling. Most podcasters are not selling anything.

So how do we redefine what a podcaster for your own specific show? How do you redefine what success is? What does your podcast mean to you? That’s not just ROI and whatever number, percentage growth, download, blah, blah, blah. Because really, if, especially if you’re doing this as a passion project, it means nothing in the grand scheme of things. So, yeah, like, what. What can we do or show that is something that’s a bit more tangible for that? You can be like, yeah, I’m, I’m a successful podcaster.

JESS: I don’t have an answer to that.

MARY: No.

JESS: This can be your next episode if you get somebody to answer this question.

[LAUGHTER]

MARY: Well, I do have an episode called, The Intangible Values of a Podcast. So I touch on it. But again, it’s intangible. It’s something for you to be like, okay, I want to grow my network. I want to have conversations with people and grow, you know, what have you, XYZ. And so you there are. There are things, but it’s on a more personal level versus industry, because industry is going to be driven on capitalism. Oh, dear. Okay.

JESS: Yep. Full circle. Full circle.

MARY: Grr.

JESS: But it is. But I think that your point. It’s so good. And, like, I love that you and Meg Wilcox one here, too, of, like, you know, people think of it as, like, oh, it has to be this or this. Like, you pick this or this. Like, Apple is showing us, like, they’ve picked SEO, but they also do have some eye to accessibility, whether or not they’re executing it, TBD, but, like, they. They are trying to do both at the same time. And I think that this is what I think is important to think about, is that it generally makes content better. It makes people happier. When you are working on things that make you happy, like all those little intangibles,…

MARY: Yeah.

JESS: …it’s like, that’s immeasurable, but at the same time, it’s, like, vital.

MARY: Yes.

JESS: Like, you can’t do it without those pieces, and so trying to separate them out is really impossible. It’s just a matter of, like, how do we kind of figure out how to. To square both of these things together?

MARY: Yeah. Because nobody wants to be on a, uh, hamster wheel of, like, okay, I. I have to create a podcast episode now. I have to do it. And so there’s no joy in that.

JESS: No. I always am. Like, the amount of podcast emergency. Podcast emergency is an oxymoron to me. You know, like, that shouldn’t exist. It’s supposed to be fun. You’re supposed to be having a good time. I’ve cried a lot while I make podcasts, so I. It’s not 100% of the time that it’s a good time, but most of the time, I’m like, wait, I’m supposed to be having a good time here? You know?

MARY: Yeah. And I think a good time can also equate an emotional time. You want that emotional connection, whether it be the crying, the anger, the rage, or laughing a lot, which we have done.

JESS: Yeah.

MARY: So thanks, Jess, for coming on the show. I appreciate the laughter and the connection.

JESS: Hey, I always will talk to you about anything, so I’m happy I got to talk about something I’m really nerdy about. Thank you for having me.

[JOYFUL MUSIC IN]

MARY: I love nerding out. And this is about transcripts and captions. I never thought I would ever nerd out about transcription captions, but here we go. Uh, definitely a good time. Thanks, Jess, for coming on the show. So after listening to that, what’s got you fired up right now? You know, right after I hit stop on that recording, I actually went down the little rabbit hole of craptions history, and I started researching more stuff. So if you want to read this great article from Wired, it’s back in 2019 about the problem with craptions. It’s a great little walk down memory lane for me, because now that I read the article, I was like, oh, yeah, I do remember this happening. So I’ll put the link in the show notes and you can go give that little a new another layer of the mini history of lesson on closed captionings on the YouTube side.

[MUSIC CONTINUES AND BUILDS]

Okay, so Jess and I really want to know, how are you creating transcripts or are you even creating your own transcripts? Are you thinking, oh, yeah, well, Apple and Spotify does it good enough, right? Like, there’s so many layers to this. And even when I asked Jess this, I was wondering, like, is my process, can it be improved? And so what we do with Organized Sound Productions, my podcast production and consulting company, which is exactly what we do for my show, this podcast here is that once the audio is finished, we plunk that audio into Edit Eddy. That’s an online platform that’s part of Headliner. They own that. And Headliner, if you don’t know, is an online program that helps create audiograms. And that’s what a lot of podcasters use Headliner for.

So the transcript gets spit out by Edit Eddy. We then copy and paste that text into a Google Doc. Here in the Google Doc, we actually have humans to clean it up. So again, thank you, Kristalee. She does all of my transcripts, and then I run through it afterwards with my eyes too. So we try to get to that 97% accuracy that Jess is talking about. We do a lot of research when it comes to, like, technical names and companies and even for industries that we’re not involved in, we make sure we try and get all the nouns correct, as Jess had mentioned. So we do that.

I’ve always been thinking about those audio descriptions like, I put in music in and music out, but really it needs more than that. And I’ve been thinking about it for a long time. So we’re going to do that as well for our transcripts now. That’s going to be something that will integrate into our process because Edit Eddy or whatever AI version you use won’t have that option. So we’re going to add all of those things into our Google Doc, clean it up. But this is where the timestamps get lost, right? If we edited this in Edit Eddy and then exported as an SRT, the timestamps would still be there. But it’s the whole user experience thing. And we find that if we actually edit on a Google Doc, it just feels more seamless versus having the Edit Eddy version of editing the Doc.

So we do lose the timestamp part of it or the temporal time locks that Jess mentions because it is harder on the end user to collaborate efficiently and to really smoothly quickly go through that Google Doc when it’s not on a Google Doc. So, that’s one thing that we are missing. So that Apple and Spotify won’t actually pick up our approved transcripts, our reviewed ones. But like Jess said, you know, if you do want a human reviewed transcript, it is on my website for this episode and a whole bunch of episodes that we’ve done, VisibleVoicePodcast.com if you want to use those instead of the auto-generated, Apple, Spotify or even YouTube ones. All right? So there is that.

So I just really want to know how are you creating transcripts? Because that’s our process here. But what process do you have going on, especially if you’re listening to this well into the future of when it was originally published, which is late February 2026. Are there any new tools to help with this or are we still at the same stage where we’re at this when this episode was first published? Like nothing has really evolved since February 2026. I need to know if you’re listening to this, especially in the future. Well, not future for me, but present for you. Send me a voice note from my website, VisibleVoicePodcast.com or email is always lovely as well. VisibleVoicePodcast@gmail.com so send me your feedback. I’d love to know.

[MUSIC CONTINUES]

On the next episode, what’s the difference between an intro and a trailer? How often do you hit play on a podcast episode and the intro seems like it’s going on forever, telling you everything about the show before we even get to the episode itself? I’ve heard this many, many times on shows I’ve been exploring, right? Like, the organic search. Ooh, is this one interesting? I’m gonna hit play. Well, I hit stop a lot and move on because dang, if I become a superfan, that’s a lot of boring, repetitive intros. I have to fast forward before I get to the good stuff.

So I want to help you out with that part of your show. Let’s figure out how to create a compelling intro and know the difference between a trailer and an intro. Do you need one? Do you need both? Do you need clips? What sort of intros are common these days? And more importantly, which one, which format with ideas will actually work for you and your show. I’ll show you how and what works on the next episode.

[JOYFUL MUSIC ENDS]

<< OUTRO – SHOW CLOSE // UPBEAT THEME MUSIC IN – GHOSTHOOD FEATURING SARA AZRIEL “LET’S GO” BEGINS >>

MARY: Thank you so much for listening to the podcaster’s guide to a Visible Voice. If you enjoyed this episode, I’d love it if you share it with a podcasting friend. And to reveal more voicing and podcasting tips, click on over to VisibleVoicePodcast.com. Until next time.

<< WOMAN SINGS: Let’s go >>

<< MUSIC ENDS >>

How to Create Transcripts That Benefit Accessibility and SEO with Jess Schmidt – Episode 109

How are you creating impactful transcripts for your show?

Transcript with Audio Description

Related Posts