Automating AI-Powered Speech Recognition and Sentiment Analysis

This project was carried out as part of the TechLabs “Digital Shaper Program” in Aachen (Summer Term 2020).


In this project the performance of the Google Cloud API and Descript with regards to speech recognition accuracy was determined and compared. The measuring methods were the word error rate and the speaker identification rate. To automate the analysis a pipeline was built, which takes the transcripts and the audio files as input and gives the resulting accuracy values as output. Additionally, a sentiment analysis was integrated into the pipeline.


Transcribing meetings, interviews or speeches are time consuming. However, over the last couple of years there has been a surge in AI based speech recognition solutions. Most of the available tools are offered by tech giants such as Amazon, Microsoft and Google. The accuracy of the tools provided by the big companies is far superior to a solution that could be developed independently. Therefore, in this project two existing tools were used for speech recognition and were compared with each other.

Google Cloud Speech-to-Text API was chosen due to the advantage of Google having massive audio content repositories, enabling the access to a huge amount of training audio to build a speech model. Descript, however, is an app that automatically transcribes audio files. This app uses a third-party transcription engine, in this case the Google Cloud Speech-to-Text API. Descript uploads the files to Google Cloud Storage and from there the files are fed into the Speech-to-Text API. Even though Descript uses the Google API, there are big differences in the pricing. The plans offered by Descript are less expensive than the Google API. Therefore, within this project the Google Cloud Speech-to-Text API was compared with Descript to find out the origin of the pricing difference.


A self-written dialogue with three speakers was prepared, which also served as the “ground truth” for the accuracy measurement. This dialogue was read from the participants and the audio files were recorded under different conditions. Varying parameters were the number of speakers and the background sounds. The number of speakers ranged from one to three speakers. For each number of speakers, one audio without background noise, one with each classical music, music with lyrics and traffic noise was recorded.

To automate the analysis of the transcripts, a pipeline was built, which takes the generated transcripts as input and gives the analysis results as output.

Generating the transcripts
After building the audio dataset, the audio files were manually uploaded to Google Cloud Speech-to-Text API and Descript for generating the transcripts.

Google offers the option to choose between different machine learning models depending on the application. The further advantages include the support of different languages, recognition of multiple speakers as well as punctuation. Moreover, it is possible to enter the number of speakers in the audio file to achieve better results. Compared to Google, Descript only offers to transcribe automatically or using the “White Glove” feature, which is a human powered transcription with additional costs. Since this work deals with machine generated transcripts, the automatic transcription was chosen. However, no information regarding the applied models were given. Descript also provided the possibility to automatically detect multiple speakers if desired.

All generated transcripts were adjusted to the ground truth in terms of number of lines to simplify the further analysis.


Measuring Accuracy: WER and Speaker Identification
The transcription accuracy was determined by calculating the word error rate (WER) and the speaker identification rate.

The determination of the WER was conducted with the the python package named JiWER. It calculates the difference between the ground truth sentence and the hypothesis sentence. The WER of two identical sentences is 0, whereas the WER for completely different sentences is 1.

Some pre-processing steps were necessary to be applied:

- All characters were converted into lowercase.

- Multiple spaces between words were filtered out.

- Punctuation characters were removed.

- Common English contractions were replaced such as “let’s” to “let us”.

The WER-method was successfully integrated in the pipeline.

The further method to measure the performance of Google API as well as Descript was the speaker identification rate. The transcripts were irrelevant for this method. It was only evaluated if the speakers were correctly identified. For this purpose, the special function was created. Every speaker label from ground truth transcript was compared with the corresponding speaker label from the transcripts obtained using Google API or Descript. The trouble-free comparison was achieved by using the same number of lines in all transcripts. The output was similar to WER: 0 means the successful speaker identification and in the worst case the speaker identification rate was 1.

Sentiment Analysis
The next milestone was the sentiment analysis. It was implemented as an additional feature to the pipeline. The main idea was to take the audio files with recorded speech or dialogues. The output was the sentiment described with words “positive”, “neutral” or “negative”.

For this purpose, the Google Natural Language API was chosen. It was found out that Google API can evaluate only the sentiments of the text, not of the audio file. To convert speech into text the Google Speech-To-Text API was used, and the obtained text was analyzed using Google Natural Language API. Apart from the speech analysis the text analysis feature was added. It could be used to estimate the sentiment of the reviews, responses and other text files.


Speech Recognition Analysis

Figure 1 shows the calculated WER’s for the transcripts generated with the Google Cloud Speech-to-Text API and Descript. The analysis was conducted with transcripts with varying numbers of speakers and with different background noises.

The WER analysis reveals that the Google Cloud Speech-to-Text API outperforms Descript in terms of the WER, which is surprising, since Descript uses the Google engine for its analysis. Furthermore, the accuracy of both tools decreased when background sounds were introduced into the analysis. This observation meets our expectation due to the possible interference of the background noise with the speech. The highest WER was expected for the measurement with music with lyrics. However, the analysis reveals, that traffic noise also has a great impact on the accuracy, since the obtained errors are in general in the same order of magnitude as the measurement with music with lyrics.

Descript shows a higher dependency from the background noise than Google, since the WER for its transcript is increasing more than those of Google with increasing noise. Another finding is that the Google API seems to be independent from the number of speakers, since for each condition (background noise) no significant changes in the accuracy were observed. However, the measurement with the traffic background noise is an exception and might be an outlier.

However, these results are not reflecting the real accuracy, since the WER method weights all words equally, even though they do not contribute equally to the understanding of the speech.

Speaker Identification

After analyzing the accuracy in terms of the WER, the speaker identification was determined. The results of the analysis are depicted in Figure 2.

As it can be seen, the Google API again outperforms Descript, except for the measurement without background noises. This might be traced back to measurement errors and the small number of conducted measurements. The analysis also reveals that in general the speaker identification error increases with the number of speakers.

Sentiment Analysis

The sentiments of all recorded audio files were analyzed. The majority of the results were the sentiments “positive” and “neutral” and only one audio file was rated with “negative”. The reason for this inconsistency might be the application of the Google API: the sentiment analysis was based only on text. We found it unexpected that Google does not have a feature to analyze the audio files. The influence of intonation, timbre and emotions on the sentiment is more important than only the text. And in our application this information gets lost.

The successful integration of sentiment analysis functions is another important achievement. The pipeline functions for any amount of the audio files and the results could be easily saved.


A significant amount of time was spent, while we were trying to set up Google API. The first step was to create an account and some difficulties occurred during this step: only special credit cards were accepted. Another problem was the authentication. There are various ways to authenticate calls to Google APIs. In our project the API keys were used for this purpose. The unique API key was used to associate API requests with the application. Because the pipeline was based on different Google APIs, all above mentioned issues were taken into consideration and the respective documentation was studied.

Another challenge was the construction of the pipeline. The idea was to create a complex pipeline that functions independent of the amount of the audio files and transcripts. The implementation of all functions and their interaction was an important aspect that was discussed in our meetings several times.

A difficulty, which was faced within this work was the transcript generation, since this was conducted manually. Furthermore, we faced that the number of detected speakers was not consistent with the expected number of speakers. This issue resulted in truth transcripts, which had not the same number of lines as in the ground truth. Since it is more difficult to compare files with different numbers of lines, the truth transcripts were adjusted according the own personal judgement, which might have affected the results for the speaker identification.

Totally unexpected was the inability of the Google APIs to analyze sentiment of audio files. This problem was solved by combining two different APIs. Nevertheless, the large part of the information, such as intonation and emotions, was lost.

The code was edited with Visual Studio Code and all changes were uploaded using Git. Both tools required a complicated set up. The more members are in the project, the more difficult it is to fix all errors. On the other hand, it was easy to upload the changes during the whole project’s duration. In our opinion, it was the correct decision and we would recommend it to other groups. This was our biggest learning during this project work.

Further work

Currently our application exists only as a python code. Future work could be the development of the pipeline as an App. An audio file could be used as an input and the analysis could be done fully automated.

Another idea for further work is the improvement of the sentiment analysis function. As mentioned above, the Google APIs are not able to evaluate audio files. Alternative tools for estimating the sentiment of a speech could be tested and analyzed.

TechLabs Aachen e.V. reserves the right not to be responsible for the topicality, correctness, completeness or quality of the information provided. All references are made to the best of the authors’ knowledge and belief. If, contrary to expectation, a violation of copyright law should occur, please contact so that the corresponding item can be removed.

Our community Members share their insights into the TechLabs Experience