How to Use AI Transcription Services with MAXQDA

MAXQDA Research Blog

April 2019

MAXQDA

Research Example

AI Transcription Services

How to Use AI Transcription Services with MAXQDA

Matthew Loxton

Abstract: The advent of machine learning represents a dramatic change in the effectiveness and efficiency

of voice-to-text software. The same machine-learning technology can be used to transcribe interviews. I

reviewed 3 AI transcription services and used one of them throughout a current MAXQDA 2018 project.

Keywords: AI, Automatic speaker change, Computer Assisted Qualitative Analysis, functions, interview

data, Machine Learning Technology, Multimedia Browser, transcribe interviews, voice-to-text

Table of Contents

1. Introduction

2. How MAXQDA’s Transcription Mode Feature Can Be Used in Conjunction with Machine Learning

Voice-to-Text Services

3. Recording and importing the interviews

4. Method 1 – Microsoft Word

5. Method 2 – SRT File

6. Conclusion

Editor´s Note

1. Introduction

For some people, such as myself, MAXQDA’s Transcription Mode tools are a wonderful help. The

software is not, however, able to make us efficient in transcribing audio files. Even with the many

functions available in MAXQDA’s Multimedia Browser, such as keyboard shortcuts and automatic

speaker changes, I am still far too clumsy with hitting F4, the pause button, or the various other

buttons, to transcribe the audio efficiently.

My typing is too slow, my working-memory too short, and I get confused between the function

keys and buttons. As a result, transcribing a 30-minute audio file takes me several hours, and by

the end, I am exhausted.

2. How MAXQDA’s Transcription Mode Feature Can Be Used in Conjunction

with Machine Learning Voice-to-Text Services

The advent of machine learning represents a dramatic change in the effectiveness and efficiency

of voice-to-text software. Where previously, voice-activated systems and voice-to-text applications

were stuck at low levels of accuracy and speed, the use of machine learning has resulted in a

variety of devices and services that boast of 95% and higher accuracy.

So much so, that Siri, Alexa, Cortana, and Google Home are now part of life for millions of people.

Google Duplex takes this further and is almost indistinguishable from a human assistant.

Transcribing interviews

MAXQDA Research Blog

April 2019

MAXQDA

Research Example

AI Transcription Services

The same machine-learning technology can be used to transcribe interviews. I reviewed three

different options and services, ultimately using one of them throughout a current project.

My aim was to speed up my transcription process, and to remove the frustration that normally

accompanies this step in the research process for me. I recorded nine interviews, which ranged

from 18-48 minutes, and in which there was one participant, the primary interviewer, and myself.

Three voice to text services

Based on some research on popular services, the three services that I looked at were:

I found that approximately 1-2% of the participant’s text from the two Machine transcriptions

needed edits, typically for similar-sounding words such as “IV” vs “IB”, or “wake” vs “wait”. I

also needed to change the speaker names manually. I had higher error rates in the interviewer

text, but this was not of significance to my needs.

The low cost and almost immediate turnaround of the machine transcriptions meant that for just

over $20 in total, I could have very workable transcriptions almost ready to code within minutes

for nine of the half-hour interviews.

I found that making the few edits required was far less frustrating and time-consuming than

manual transcription had been in the past. This is obviously a function of my typing speed and

keyboard dexterity, but may be similar for many other researchers.

3. Recording and importing the interviews

I used Skype for Business to create the meeting requests for the interviews, provide Voice over IP

(VoIP) services and dial-in numbers, and allow me to record the call as an MP4 file with good

audio quality. I experimented with two alternative processes:

In the first, I used the content of an MS Word document from the transcription provider, and in

the second process, I used MAXQDA’s Transcripts with Timestamps function to import a SubRip

(SRT) file with timestamps from the provider.

My transcription environment, captured in Figure 1, shows text from the machine learning service

pasted into MAXQDA’s “Document Browser” window for the selected audio file imported into

the Document System (highlighted in the “Document System” window).

At this point, I am about to edit the name of the second speaker using the Automatic Speaker

Change and Autotext functions:

MAXQDA Research Blog

April 2019

MAXQDA

Research Example

AI Transcription Services

4. Method 1 – Microsoft Word

My process using the MS Word file was as follows:

1. Download the transcription from the online service as an MS Word document,

2. Import the MP4 into MAXQDA,

3. Open the file for transcription,

4. Paste the content of the Word file into the Transcription Mode “Document Browser”

window,

5. Run the audio in MAXQDA’s Multimedia Browser, and make any edits in the text as

needed,

6. Add timestamps at each speaker change or where desired.

Method 1 Findings

The method was effective, and I will continue to use this in production because it has greatly

reduced the time and effort to transcribe audio tracks at very low cost. There were some caveats,

however.

Speaker changes

Firstly, the machine transcription did a reasonably good job of speaker changes, but sometimes it

got confused. Sometimes it thought there was a different speaker, when in reality, it was the

same person continuing to speak (Figure 2).

This was not a major hurdle, and the changeovers were reasonably obvious, and mistakes were

easily corrected. Sometimes it changed over to a new speaker a little late, and switched speakers

several words into the next speaker’s text. In Figure 3 the highlighted text belongs to Speaker 1,

not to the previous speaker. This too was reasonably easy to spot, and to edit.

MAXQDA Research Blog

April 2019

MAXQDA

Research Example

AI Transcription Services

Multiple speakers

Secondly, it did not like multiple people talking at the same time. When there were multiple

interleaved or simultaneous speakers, it tended to think it was the same speaker, and also

sometimes lost several words of each speaker.

This issue can be seen in the first paragraph in Figure 1 (above) where the participant and

interviewer start the interview talking about whether the VoIP system had announced that it

was recording. Both speakers were captured as a single person, and their phrases intermingled.

This was harder to identify, and to edit but tended to happen at specific points in the interview,

such as when there was a transition in topic. However, there were at least one or two

occurrences in each interview, especially when something startling or funny was said, and

multiple people interjected.

For example, when one participant commented that they had been left alone in the MRI room

and staff had left for lunch, both interviewers interjected with comments, laughter, whistles,

etc. The machine transcription missed some of the words, and mingled them together in small

phrases. This was less easy to edit, but not any worse than was the case for me doing it

manually.

Lower volumes

Lastly, it often ignored audio that was significantly lower volume than the rest (it may have

calculated an average volume and then filtered out quieter sounds as noise). One recurring

situation brought this to light. At the end of each participant track, the interviewer typically

thanked them in a much quieter voice and then used normal volume to initiate the next question

or comment.

In many cases, these “asides” were not transcribed at all. This was not a problem for my situation,

but it may be a significant issue if the participant varies in volume.

5. Method 2 – SRT File

My method for the SRT file used MAXQDA’s Import Transcripts with Timestamps function:

1. Download the SRT file,

2. Import the SRT into MAXQDA using the Import Transcripts with Timestamps

function,

3. Point to the associated MP4 file in the resulting dialogue box,

4. Open the file for transcription,

5. Run the audio in MAXQDA’s Multimedia Browser,

6. Make edits in the text as needed.

MAXQDA Research Blog

April 2019

MAXQDA

Research Example

AI Transcription Services

Method 2 Findings

The accuracy rate was the same, and the same cost and logistics considerations applied. The

caveats were also similar, and although importing the SRT had an advantage in not requiring

me to add timestamps manually, it resulted in a highly fragmented text layout inherited from

the SRT file structure.

While the audio and text tracked well together, viewing the text as a vertical list of fragments

was difficult to read, and coding was made significantly more difficult. As a result, I did not

continue with the use of the SRT file method after the initial test.

6. Conclusion

The low cost and short turnaround time make machine-learning transcription services worthwhile

to some researchers whose keyboard dexterity makes manual transcription tedious. The services

do not offer perfect transcription, however, and each method to use them comes with some

caveats.

Using the MS Word import was found to be preferable to using the SRT method, and what was

lost in the more comprehensive timestamping was gained in the greater readability of the

resulting text of the Word import.

My overall finding was that both TEMI and SPEXT were good enough to continue to use, but that

if high fidelity was needed and cost and time were less of an issue, the human-machine

combination would be an attractive option.

Editor´s Note

Matthew Loxton is a Principal Analyst at Whitney, Bradley, and Brown Inc. focused on healthcare

improvement, serves on the board of directors of the Blue Faery Liver Cancer Association, and

holds a master’s degree in KM from the University of Canberra. Matthew is the founder of the

Monitoring & Evaluation, Quality Assurance, and Process Improvement (MEQAPI) organization,

and regularly blogs for Physician’s Weekly. Matthew is active on social media related to healthcare

improvement and hosts the weekly #MEQAPI chat.