前言

本文章主要介紹如何快速使用MASR語音識別框架訓練和推理,本文將致力於最簡單的方式去介紹使用,如果使用更進階功能,還需要從源碼去看文檔。僅需三行代碼即可實現訓練和推理。

源碼地址:https://github.com/yeyupiaoling/MASR

安裝環境

使用Anaconda,並創建了Python3.11的虛擬環境。

  • 首先安裝的是Pytorch 2.5.1 的GPU版本,如果已經安裝過了,請跳過。
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1  pytorch-cuda=11.8 -c pytorch -c nvidia
  • 使用pip安裝MASR庫,命令如下:
python -m pip install masr -U -i https://pypi.tuna.tsinghua.edu.cn/simple

準備數據集

執行下面代碼即可自動完成下載數據,和製作數據列表。默認下載可能會比較慢,可以複製下載地址用迅雷等工具下載,並指定filepath爲下載好的文件路徑,可以快速完成製作數據列表。

import argparse
import os
import functools
from utility import download, unpack
from utility import add_arguments, print_arguments

DATA_URL = 'https://openslr.trmal.net/resources/33/data_aishell.tgz'
MD5_DATA = '2f494334227864a8a8fec932999db9d8'

parser = argparse.ArgumentParser(description=__doc__)
add_arg = functools.partial(add_arguments, argparser=parser)
add_arg("target_dir", default="dataset/audio/", type=str, help="存放音頻文件的目錄")
add_arg("annotation_text", default="dataset/annotation/", type=str, help="存放音頻標註文件的目錄")
add_arg("filepath", default=None, type=str, help="提前下載好的數據集壓縮文件")
args = parser.parse_args()


def create_annotation_text(data_dir, annotation_path):
    print('Create Aishell annotation text ...')
    if not os.path.exists(annotation_path):
        os.makedirs(annotation_path)
    f_train = open(os.path.join(annotation_path, 'aishell.txt'), 'w', encoding='utf-8')
    if not os.path.exists(os.path.join(annotation_path, 'test.txt')):
        f_test = open(os.path.join(annotation_path, 'test.txt'), 'w', encoding='utf-8')
    else:
        f_test = open(os.path.join(annotation_path, 'test.txt'), 'a', encoding='utf-8')
    transcript_path = os.path.join(data_dir, 'transcript', 'aishell_transcript_v0.8.txt')
    transcript_dict = {}
    for line in open(transcript_path, 'r', encoding='utf-8'):
        line = line.strip()
        if line == '': continue
        audio_id, text = line.split(' ', 1)
        # remove space
        text = ''.join(text.split())
        transcript_dict[audio_id] = text
    data_types = ['train', 'dev']
    for type in data_types:
        audio_dir = os.path.join(data_dir, 'wav', type)
        for subfolder, _, filelist in sorted(os.walk(audio_dir)):
            for fname in filelist:
                audio_path = os.path.join(subfolder, fname).replace('\\', '/')
                audio_id = fname[:-4]
                # if no transcription for audio then skipped
                if audio_id not in transcript_dict:
                    continue
                text = transcript_dict[audio_id]
                f_train.write(audio_path.replace('../', '') + '\t' + text + '\n')
    audio_dir = os.path.join(data_dir, 'wav', 'test')
    for subfolder, _, filelist in sorted(os.walk(audio_dir)):
        for fname in filelist:
            audio_path = os.path.join(subfolder, fname).replace('\\', '/')
            audio_id = fname[:-4]
            # if no transcription for audio then skipped
            if audio_id not in transcript_dict:
                continue
            text = transcript_dict[audio_id]
            f_test.write(audio_path.replace('../', '') + '\t' + text + '\n')
    f_test.close()
    f_train.close()


def prepare_dataset(url, md5sum, target_dir, annotation_path):
    """Download, unpack and create manifest file."""
    data_dir = os.path.join(target_dir, 'data_aishell')
    if not os.path.exists(data_dir):
        if args.filepath is None:
            filepath = download(url, md5sum, target_dir)
        else:
            filepath = args.filepath
        unpack(filepath, target_dir)
        # unpack all audio tar files
        audio_dir = os.path.join(data_dir, 'wav')
        for subfolder, _, filelist in sorted(os.walk(audio_dir)):
            for ftar in filelist:
                unpack(os.path.join(subfolder, ftar), subfolder, True)
        os.remove(filepath)
    else:
        print("Skip downloading and unpacking. Aishell data already exists in %s." % target_dir)
    create_annotation_text(data_dir, annotation_path)


def main():
    print_arguments(args)
    if args.target_dir.startswith('~'):
        args.target_dir = os.path.expanduser(args.target_dir)

    prepare_dataset(url=DATA_URL,
                    md5sum=MD5_DATA,
                    target_dir=args.target_dir,
                    annotation_path=args.annotation_text)


if __name__ == '__main__':
    main()

訓練

使用MASR框架訓練非常簡單,核心代碼就3行,如下,configs參數可以指定使用的默認配置文件。

from masr.trainer import MASRTrainer

trainer = MASRTrainer(configs="conformer", use_gpu=True)

trainer.train(save_model_path="models/")

輸出類似如下:

2025-03-08 11:04:57.884 | INFO     | masr.optimizer:build_optimizer:16 - 成功創建優化方法Adam參數爲:{'lr': 0.001, 'weight_decay': 1e-06}
2025-03-08 11:04:57.884 | INFO     | masr.optimizer:build_lr_scheduler:31 - 成功創建學習率衰減WarmupLR參數爲:{'warmup_steps': 25000, 'min_lr': 1e-05}
2025-03-08 11:04:57.885 | INFO     | masr.trainer:train:541 - 詞彙表大小5561
2025-03-08 11:04:57.885 | INFO     | masr.trainer:train:542 - 訓練數據13382
2025-03-08 11:04:57.885 | INFO     | masr.trainer:train:543 - 評估數據27
2025-03-08 11:04:58.642 | INFO     | masr.trainer:__train_epoch:414 - Train epoch: [1/200], batch: [0/836], loss: 51.60880, learning_rate: 0.00000008, reader_cost: 0.1062, batch_cost: 0.6486, ips: 21.1991 speech/sec, eta: 1 day, 11:03:13

導出模型

訓練完成之後還需要導出模型才能進行推理,導出模型也非常簡單。需要三行代碼,如下:

from masr.trainer import MASRTrainer

# 獲取訓練器
trainer = MASRTrainer(configs="conformer", use_gpu=True)

# 導出預測模型
trainer.export(save_model_path='models/',
               resume_model='models/ConformerModel_fbank/best_model/')

推理

推理也相當簡單,只需要下面三行代碼即可完成語音識別。

from masr.predict import MASRPredictor

predictor = MASRPredictor(model_dir="models/ConformerModel_fbank/inference_model/", use_gpu=True)

audio_path = "dataset/test.wav"
result = predictor.predict(audio_data=audio_path)
print(f"識別結果: {result}")

輸出如下:

2025-03-08 11:21:52.100 | INFO     | masr.infer_utils.inference_predictor:__init__:38 - 已加載模型models/ConformerModel_fbank/inference_model/inference.pth
2025-03-08 11:21:52.147 | INFO     | masr.predict:__init__:117 - 流式VAD模型已加載完成
2025-03-08 11:21:52.147 | INFO     | masr.predict:__init__:119 - 開始預熱預測器...
2025-03-08 11:22:01.366 | INFO     | masr.predict:reset_predictor:471 - 重置預測器
2025-03-08 11:22:01.366 | INFO     | masr.predict:__init__:128 - 預測器已準備完成
識別結果: {'text': '近幾年不但我用書給女兒壓歲也勸說親朋不要給女兒壓歲錢而改送壓歲書', 'sentences': [{'text': '近幾年不但我用書給女兒壓歲也勸說親朋不要給女兒壓歲錢而改送壓歲書', 'start': 0, 'end': 8.39}]}

結語

該框架支持多個語音識別模型,包含deepspeech2conformersqueezeformerefficient_conformer等,每個模型都支持流式識別和非流式識別,以及多種解碼器,包含ctc_greedy_searchctc_prefix_beam_searchattention_rescoringctc_beam_search等。更多功能等你發現。

小夜