تحسين أنظمة التعرف التلقائي على الكلام باستخدام نموذج Conformer وتقنيات تقليل السمات
Keywords:
Deep learning ,Automatic speech recognition (ASR),Neural Networks,word error Rate (WER),Feature Reduction,Fbank ,ConformerAbstract
Automatic Speech Recognition (ASR) is a leading field in artificial intelligence and deep learning technologies, with broad applications in digital assistants, speech-to-text conversion, and voice interaction in smart devices. This study aims to present an improved methodological framework to enhance the efficiency and accuracy of ASR systems, relying on advanced audio processing techniques and an effective hybrid modeling architecture.
The proposed system relies on the use of the energy spectrum of filter banks (Fbanks) as an alternative to traditional spectral coefficients such as MFCC, as it provides accurate spectral information that helps improve the recognition of audio patterns. The SpecAugment technique, based on temporal and frequency convolution, was also employed to increase the diversity of data used in training and enhance the model's ability to generalize across diverse audio environments. To build the model, the Conformer architecture was adopted, a hybrid architecture that combines Convolutional Neural Networks (CNNs) and Transformers, enabling the model to capture temporal, local, and global acoustic patterns more efficiently. The proposed system also features a reduction in the number of acoustic features to only 53, which contributes to reducing computational complexity and resource consumption without negatively impacting performance.
Experimental results demonstrated the model's superiority in terms of efficiency and accuracy, with a Word Error Rate (WER) of approximately 19%, with a Validation Loss of 0.21. These results confirm that the proposed system is capable of handling the challenges of real-world audio data and represent a promising step toward improving the performance of automatic speech recognition systems. This study also paves the way for future research aimed at improving model architectures and incorporating new learning techniques.