By Sofia Rodriguez
With a simple “Hey, Siri,” you can have a full conversation with your phone. It feels like your phone understands what you’re saying and knows exactly how to respond. But how does your computer understand what you’re saying? And what are the greater cultural implications of technology like this?
Previous Methods of Transcription
In the past, transcription has been a career in of itself. Since the dawn of written and spoken language, there has always been a need to keep a record of important testimonials. Before audio recording devices were prevalently used, the only way to document these events was through writing. However, through advancements in software, there has been a rise in the creation of automated speech transcription technologies that are widely used.
The Use of AI for Transcription
Because computers cannot listen and interpret language in the way that the human brain can, we have to program computers to be able to “understand” certain aspects of a recording. This is done through artificial intelligence, a term first coined by John McCarthy in 1956 meaning “human intelligence exhibited by machines.” Artificial intelligence was first used to analyze and quickly compute data but now has even greater capabilities.
The portion of AI that is used in speech transcription is a subfield of machine learning called deep learning. Machine learning is a subfield of artificial intelligence that allows computers to “learn” using complex algorithms. Computers are taught to recognize patterns rather than going through a specific set of rules to reach the desired result. Machine learning requires that cast amounts of data be taken initially with the intention of reaching a specific result. Once able to identify sets of data and make specific connections, raw data can be input to acquire a result according to the pathways it has already built. Deep learning is a subset of machine learning where the computer is able to correct itself as more data is input while building vast connections using extremely large sets of data. Computers are able to take in this much data because advancements in computer hardware and software allow for improved data compression and greater storage capacity of computers.
How It Works
Deep learning uses artificial neural networks to create connections between two points, with the intent of achieving a specific output, similar to how a human brain analyzes. The computer can then organize and access data in different sections or under different labels, allowing the connections to be clearer and creating more accurate results as the data intake increases. Although we as humans may know the inputs and outputs of the neural network, we do not know the connections the computer makes or how it makes them. This section of the neural network is called the hidden layer.
The structure of an artificial neural network (credit: Data Flair)
The way AI analyzes and determines languages is broken into 2 different systems: ASR and NLP. ASR, also known as automated speech recognition, is the ability of the computer to convert spoken word into text. It uses artificial neural networks to associate different sounds with written phrases of letters. The neural network takes in several recordings with a specific known textual interpretation, which allows it to associate specific pitches or differences in pitches with certain words. This is what allows it to recognize what different voices are saying. Small corrections are made to make the words cater to dictionary-defined vocabulary. NLP, also known as natural language processing, is the process of deriving meaning from the text. It allows strings of letters to be interpreted as meaningful sentences. Similarly, it is also able to correct the ASR to establish more reasonable definitions. However, because humans constantly speak in abbreviations, colloquialisms, and acronyms, this neural network has to be updated often to keep up with the advancement of language. If not updated, the NLP may make indefinite conclusions because it will not be able to define certain words.
AI transcription and translation can reach an accuracy rate of up to 95%. However, the average accuracy is nowhere near as high due to their many limitations. Because of the way AI reads language based on the difference in pitch, it is hard for AI to look through low-quality recording equipment, background noise, as well as thick and difficult to determine accents. It is for these same reasons that AI has trouble determining individual speakers. Similarly, machine learning has not been perfected yet because it is hard to read from audio alone. As a human, a lot of how we read and interpret language is by reading body language and facial expressions, as well as lip reading, understanding accents, and culturally accepted mispronunciations (i.e. “dunno” as opposed to “I don’t know”). Machines do not know how to understand accentuation, emotion, and inflection in the way humans do. However, understanding these minute details is the key to truly intelligent AI.
The Future of Transcription
As the world becomes increasingly digitized, we as a society become increasingly dependent on technology. Although current AI-based transcription and translation is not accurate enough to replace human-based transcription and translation, as technology undeniably evolves and advances, the accuracy will increase as the data intake pool is diversified. Because there is no way to guarantee 100% accuracy, there is no way that the role of a transcriber would disappear altogether. Any digitized transcription would need to be checked by a person, which would serve as the new role of transcribers, causing little flux in the economy due to lack of work. Furthermore, as the supply and accessibility of digitized transcription becomes more accessible, it would undoubtedly increase the demand for digitized transcription and translation tools. This, in turn, would boost the profits of the leading companies that provide such tools, such as Amazon’s Alexa, Apple’s Siri, Microsoft’s Cortana, Samsung ‘s Bixby, Trint, Soinix, Happy Scribe, Descript, and Otter.ai.
Finney, Jennifer. "AI vs Human Transcription Accuracy for Speech-to-Text Services." Rev, 8 Aug. 2019, www.rev.com/blog/ai-vs-human-transcription-accuracy. Accessed 10 Mar. 2020.
Hopwood, Sean Patrick. "AI for Voice Transcription: Is It Here to Last?" AIthority, Dec. 2019, www.aithority.com/guest-authors/ai-for-voice-transcription-is-it-here-to-last/. Accessed 10 Mar. 2020.
"How Does Automated Transcription Work?" Trint, blog.trint.com/how-does-automated-transcription-work. Accessed 11 Mar. 2020.
"The Limitations of Machine Learning for Transcription." McGowan Transcribe + Translate, www.mcgowantranscriptions.co.uk/machine-learning-ai-transcription/. Accessed 10 Mar. 2020.
Markoff, John. "From Your Mouth to Your Screen, Transcribing Takes the Next Step." The New York Times [New York City], 2 Oct. 2019, www.nytimes.com/2019/10/02/technology/automatic-speech-transcription-ai.html. Accessed 10 Mar. 2020.
Myers, Erin. "The Role of Artificial Intelligence and Machine Learning in Speech Recognition." Rev, 25 Aug. 2019, www.rev.com/blog/artificial-intelligence-machine-learning-speech-recognition. Accessed 11 Mar. 2020.
Noone, Greg. "When AI Can Transcribe Everything." The Atlantic [London], 20 June 201, www.theatlantic.com/technology/archive/2017/06/automated-transcription/530973/. Accessed 10 Mar. 2020.