Introduction

OpenAI has open-sourced Whisper, a speech recognition project that claims its English speech recognition capability is on par with humans, and it also supports automatic speech recognition in 98 other languages. Whisper provides automatic speech recognition and translation tasks, which can convert speech in various languages into text and translate those texts into English. The main goal of this project is to fine-tune the Whisper model using LoRA. Several models are currently available for open-source, which can be viewed at openai. The commonly used models are listed below. Additionally, the project finally uses CTranslate2 for accelerated inference of speech recognition, and it’s worth noting that accelerated inference supports direct conversion of the original Whisper model without necessarily requiring fine-tuning.

  • openai/whisper-tiny
  • openai/whisper-base
  • openai/whisper-small
  • openai/whisper-medium
  • openai/whisper-large
  • openai/whisper-large-v2

Source Code Address: Whisper-Finetune

Environment Used:

  • Anaconda 3
  • Python 3.8
  • PyTorch 1.13.1
  • Ubuntu 18.04
  • GPU A100-PCIE-40GB*1

Introduction to Main Program Files

  1. aishell.py: Prepares AIShell training data.
  2. finetune.py: Fine-tunes the model.
  3. merge_lora.py: Merges the Whisper and LoRA models.
  4. evaluation.py: Evaluates the fine-tuned model or the original Whisper model.
  5. infer_tfs.py: Uses transformers to directly call the fine-tuned or original Whisper model for prediction, suitable only for short audio inference.
  6. infer_ct2.py: Uses the model converted to CTranslate2 for prediction, mainly referring to this program’s usage.
  7. infer_gui.py: Provides a GUI interface for prediction using the CTranslate2-converted model.
  8. infer_server.py: Deploys the CTranslate2-converted model to the server for client-side calls.

Welcome to join the knowledge planet for discussions. The knowledge planet provides project model files, model files of other related projects by the blogger, and other resources.

![](/static/files/2023-04-23/5db7f88df40b42c998f162490919436b.png)

Model Test Table

  1. Word Error Rate (WER) Test Table for Original Models
Model Used Specified Language aishell_test test_net test_meeting Model Acquisition
whisper-tiny Chinese 0.31898 0.40482 0.75332 Available by joining the knowledge planet
whisper-base Chinese 0.22196 0.30404 0.50378 Available by joining the knowledge planet
whisper-small Chinese 0.13897 0.18417 0.31154 Available by joining the knowledge planet
whisper-medium Chinese 0.09538 0.13591 0.26669 Available by joining the knowledge planet
whisper-large Chinese 0.08969 0.12933 0.23439 Available by joining the knowledge planet
whisper-large-v2 Chinese 0.08817 0.12332 0.26547 Available by joining the knowledge planet
  1. Word Error Rate (WER) Test Table after Fine-Tuning Dataset
Model Used Specified Language Dataset aishell_test test_net test_meeting Model Acquisition
whisper-tiny Chinese AIShell 0.13043 0.4463 0.57728 Available by joining the knowledge planet
whisper-base Chinese AIShell 0.08999 0.33089 0.40713 Available by joining the knowledge planet
whisper-small Chinese AIShell 0.05452 0.19831 0.24229 Available by joining the knowledge planet
whisper-medium Chinese AIShell 0.03681 0.13073 0.16939 Available by joining the knowledge planet
whisper-large-v2 Chinese AIShell 0.03139 0.12201 0.15776 Available by joining the knowledge planet
whisper-tiny Chinese WenetSpeech 0.17711 0.24783 0.39226 Available by joining the knowledge planet
whisper-large-v2 Chinese WenetSpeech 0.05443 0.10087 0.19087 Available by joining the knowledge planet
  1. Inference Speed Test Table (Non-accelerated vs Accelerated), GPU: GTX3090 (24G)
Model Used Original Model Real-time Rate (float16) Real-time Rate with CTranslate2 Acceleration (float16) Real-time Rate with CTranslate2 Acceleration (int8_float16)
whisper-tiny 0.03 0.06 0.06
whisper-base 0.04 0.06 0.06
whisper-small 0.08 0.08 0.08
whisper-medium 0.13 0.10 0.10
whisper-large-v2 0.19 0.12 0.12
  1. Processed Data List
Data Processing Method AiShell WenetSpeech
Add Punctuation Marks Available by joining the knowledge planet Available by joining the knowledge planet
Add Punctuation Marks and Timestamps Available by joining the knowledge planet Available by joining the knowledge planet

Important Notes:

  1. During evaluation, the punctuation marks output by the model are removed, and traditional Chinese is converted to simplified Chinese.
  2. aishell_test is the test set of AIShell, while test_net and test_meeting are the test sets of WenetSpeech.
  3. RTF = Total audio time (seconds) / Total ASR processing time (seconds).
  4. The audio used for speed testing is dataset/test.wav, with a duration of 8 seconds.
  5. The training data uses data with punctuation marks, which may result in a higher word error rate.
  6. Fine-tuning the AiShell data does not include timestamps, while fine-tuning the WenetSpeech data includes timestamps.

Installation Environment

  1. First, install the GPU version of PyTorch. If already installed, skip this step.
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia
  1. Install required dependency libraries.
python -m pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

Data Preparation

The training dataset is a JSONL file, where each line is a JSON data entry. The data format is as follows: Whisper supports punctuation marks, so the training dataset can include them. This project provides a program aishell.py to generate training and test sets. Note: This program can skip downloading by specifying the AIShell compressed file path. If direct download is slow, use download managers like Thunder to get the dataset and specify the path via the --filepath parameter, e.g., /home/test/data_aishell.tgz. If training without timestamps, the sentences part can be omitted.

{
   "audio": {
      "path": "dataset/0.wav"
   },
   "sentence": "近几年,不但我用书给女儿压岁,也劝说亲朋不要给女儿压岁钱,而改送压岁书。",
   "sentences": [
      {
         "start": 0,
         "end": 1.4,
         "text": "近几年,"
      },
      {
         "start": 1.42,
         "end": 8.4,
         "text": "不但我用书给女儿压岁,也劝说亲朋不要给女儿压岁钱,而改送压岁书。"
      }
   ],
   "duration": 7.37
}

Model Fine-Tuning

After preparing the data, you can start fine-tuning the model. The two most critical parameters are:
- --base_model: Specifies the base Whisper model to fine-tune. This model must exist on HuggingFace and can be downloaded automatically during training, or you can specify a local path with --local_files_only=True.
- --output_path: Specifies the save path for the LoRA checkpoint after training.

Single GPU Training

CUDA_VISIBLE_DEVICES=0 python finetune.py --base_model=openai/whisper-tiny --output_dir=output/

Multi-GPU Training

There are two methods for multi-GPU training: torchrun and accelerate.

  1. Using torchrun to start multi-GPU training:
torchrun --nproc_per_node=2 finetune.py --base_model=openai/whisper-tiny --output_dir=output/
  1. Using accelerate to start multi-GPU training:

First, configure training parameters (answer the prompts, mostly default):

accelerate config

Then start training:

accelerate launch finetune.py --base_model=openai/whisper-tiny --output_dir=output/

Sample Training Log:

{'loss': 0.9098, 'learning_rate': 0.000999046843662503, 'epoch': 0.01}
{'loss': 0.5898, 'learning_rate': 0.0009970611012927184, 'epoch': 0.01}
{'loss': 0.5583, 'learning_rate': 0.0009950753589229333, 'epoch': 0.02}
{'loss': 0.5469, 'learning_rate': 0.0009930896165531485, 'epoch': 0.02}
{'loss': 0.5959, 'learning_rate': 0.0009911038741833634, 'epoch': 0.03}

Model Merging

After fine-tuning, two models exist: the base Whisper model and the LoRA model. They need to be merged for subsequent operations. The program requires two parameters:
- --lora_model: Path to the LoRA model saved after training (note the correct directory structure, e.g., output/checkpoint-final).
- --output_dir: Directory to save the merged model.

python merge_lora.py --lora_model=output/checkpoint-final --output_dir=models/

Model Evaluation

Run the following program to evaluate the model. Key parameters:
- --model_path: Path to the merged model (or directly use the original Whisper model, e.g., openai/whisper-large-v2).
- --metric: Evaluation method, e.g., cer (character error rate) or wer (word error rate).

python evaluation.py --model_path=models/whisper-tiny-finetune --metric=cer

Inference

Run the following program for speech recognition. This program uses transformers to call the fine-tuned or original Whisper model, suitable only for short audio inference. For long audio, refer to infer_ct2.py.

python infer_tfs.py --audio_path=dataset/test.wav --model_path=models/whisper-tiny-finetune

Accelerated Inference

To accelerate inference (original Whisper is slow), use CTranslate2. First, convert the merged model to CTranslate2 format:

ct2-transformers-converter --model models/whisper-tiny-finetune --output_dir models/whisper-tiny-ct2 --copy_files tokenizer.json --quantization float16

Then run the inference with the converted model:

python infer_ct2.py --audio_path=dataset/test.wav --model_path=models/whisper-tiny-ct2

Sample Output:

{
    "language": "zh",
    "duration": 8.39,
    "results": [
        {
            "start": 0.0,
            "end": 8.39,
            "text": "近几年不但我用书给女儿压岁也劝说亲朋友不要给女儿压岁钱而改送压岁书"
        }
    ],
    "text": "近几年不但我用书给女儿压岁也劝说亲朋友不要给女儿压岁钱而改送压岁书"
}

GUI Inference

Run the following program to use the GUI interface for inference (CTranslate2 is used for acceleration):

python infer_gui.py --model_path=models/whisper-tiny-ct2

Sample GUI Interface:

Web Deployment

Deploy the model to the server using CTranslate2 for multi-client access:

python infer_server.py --host=0.0.0.0 --port=5000 --model_path=models/whisper-tiny-ct2 --num_workers=2

API Documentation

Two interfaces are provided:
- /recognition: Standard recognition (non-streaming).
- /recognition_stream: Streaming result return (suitable for long audio).

Request Parameters:

Field Required Type Default Description
audio Yes File Audio file to be recognized
to_simple No int 1 Simplify traditional Chinese (0: No, 1: Yes)
remove_pun No int 0 Remove punctuation marks (0: No, 1: Yes)
task No String transcribe Task type: transcribe or translate

Sample Python Call for /recognition:

import requests

response = requests.post(url="http://127.0.0.1:5000/recognition", 
                         files=[("audio", ("test.wav", open("dataset/test.wav", 'rb'), 'audio/wav'))],
                         json={"to_simple": 1, "remove_pun": 0, "task": "transcribe"}, timeout=20)
print(response.text)

Sample Python Call for /recognition_stream:

import json
import requests

response = requests.post(url="http://127.0.0.1:5000/recognition_stream",
                         files=[("audio", ("test.wav", open("dataset/test_long.wav", 'rb'), 'audio/wav'))],
                         json={"to_simple": 1, "remove_pun": 0, "task": "transcribe"}, stream=True, timeout=20)
for chunk in response.iter_lines(decode_unicode=False, delimiter=b"\0"):
    if chunk:
        result = json.loads(chunk.decode())
        text = result["result"]
        start = result["start"]
        end = result["end"]
        print(f"[{start} - {end}]:{text}")

Android Deployment

The Android deployment code is in the AndroidDemo directory.

Model Conversion

  1. Clone the Whisper source code:
Xiaoye