How to Build a Dataset For Pytorch Speech Recognition

Mohamed Dyfan
6 min readAug 21, 2020


You have never stopped to think about how complex it is to communicate. Thousands of factors are involved in communication. We can predict emotions through words, body language, or tone and AI models are far from beating us at these tasks.

The purpose of this article is to show you how I built a dataset that teaches a model to classify which tone a person is using, specifically in Mandarin Chinese.

Mandarin Chinese

Mandarin Chinese is a tonal language which means the tone you use can change the word. There are five unique tones where most words include more than one. Mandarin speakers have to classify the combinations of tones in seconds to deconstruct what the speaker is saying; this is what makes Mandarin Chinese one of the hardest languages to learn.

Multitask Learning

Multitask learning allows you to train a model to perform multiple tasks; self-driving cars use this method to recognise many objects. It works by sharing information in the lower layers of the model to classify each y label, rather than building a model for each task.

I used multitask learning to build a model that predicts the combinations of tones in a word. The model can then predict consecutive tones quicker and performs similar to native Mandarin speakers.

Audio Classification

Most sound recognition projects visualise data into spectrograms — a visual representation of sound airwaves; advancements in computer vision influenced this process. Spectrograms provide higher-dimensional information of the audio data, but they have weaknesses. Humans are better at detecting differences in lower frequencies and spectrograms disregard this fact.

Use the following website and try to see if you can notice the difference between a 250hz and 300hz sound, then try 16000hz and 16050hz.

Using Spectrograms, the model will learn differently from the Mandarin speakers, which could add bias; this is where a Mel scale comes in handy. The Mel Scale transforms frequency content to represent the sounds we consider important.

A Mel Spectrogram is a spectrogram with the Mel scale on the y-axis, which is the data we need to replicate a human’s ability to recognise tones.

1. Build Data is a website you can use to request word pronunciations in many languages. The variety of words and speakers improves the generalisation of the model. Speech synthesis (artificial voice) is less prone to error, but the model will be less applicable to the real world.

2. Data Augmentation

The data cleaning method I use within this project removes nearly half of the dataset; this leaves me with insufficient data to prevent the model from overfitting. Augmenting the data solves this problem.

You can generate artificial samples by manipulating frequencies or injecting noise into the audio files. This method builds three times more of the original dataset, enough to reduce the variance of the model.

3. Data Cleaning

Audio pronunciations made from random people are bound to include error. Mislabelled or low-quality audio files can decrease model performance. These are the steps I took to clean the data.

Remove Silence

Majority of the audio files include excess silence. The pydub library can remove the excess silence, which also shortens the range of the duration. Mel spectrograms have time on the x-axis, so a smaller range means less padding is needed to ensure they all have the same dimensions. As limits are placed on my computing resources (8GB GPU), removing silence reduces the memory of the data and training time.

Removing silence also identifies files with excess noise. Audio files with excess noise remain the same length as silence is unidentified. To remove the noisy data, I deleted files for longer than 2 seconds.

Principal Component Analysis

Mel spectrograms provide higher-dimensional information but are tough to interpret. Principal Component Analysis (PCA) reduces the dimensions of the data that makes it possible to visualise the data and identify the outliers. To stay safe, I removed outliers from the training data.

Speech Recognition Analysis

Manually checking 8000+ audio files is not a scalable process. Instead, I used Google Speech Recognition to do it for me.

If google speech recognition can predict the word or tone by listening to the audio file, I assume the data is suitable for training. This assumption is imperfect, but it removes the majority of mislabelled and incomprehensible.

I also used Speech Recognition to remove words longer than two characters or to detect incorrect languages to keep the data consistent.

Sample Training Data

The distribution of the tones in the dataset is close enough to the real world. I decided to keep this distribution as resampling could remove too much data.

4. Build Training Data

All that is left to do is to convert the data into a format the Machine Learning model can understand.

I padded the audio files with silence to ensure they have the same duration — each spectrogram will then have the same dimensions.

I plotted the Mel Spectrograms using the librosa library and scaled them into images. I saved the pixels within the spectrograms into an array.

I minus each tone label by one so the classes range from 0–5 instead of 1–6; Machine Learning models are better at interpreting this range of classes. I then added the two tones of a word into the training data as a tuple.

Finally, I used the PyTorch data loader class to save into tensors, the appropriate format for training.


I used three models of different complexities to train the dataset: SimpleCNN, ComplexCNN and DeepCNN. All three models achieved an accuracy score on the two y labels above 80%, highlighting this is a high-quality dataset. The DeepCNN performed the best with an accuracy score above 90% on both y labels. But a 90% accuracy fails to reach human-level performance. The main problem is the limits placed on the request quota that prevents me from reducing the variance of the model.


I was able to teach the model to classify which tone a person is using, but I did not reach human-level performance. AI models seem to fall behind when dealing with unstructured data and pushing past human-level performance is always a challenge.

Many of the methods used in this project are transferable to other areas in communication, such as understanding emotions through the use of tones. Detecting the tones in Mandarin is a relatively simple task, so may provide a good first step towards building a model that can solve more complex problems.



Mohamed Dyfan

Data Scientist , Tech Enthusiast, Python