| --- |
| library_name: transformers |
| tags: [] |
| --- |
| # DeepAr |
|
|
| ## Model Description |
|
|
| DeepAr is a state-of-the-art Arabic Automatic Speech Recognition (ASR) model based on whisper-turbo-v3 architecture. This model represents our latest and most advanced version, trained on the complete [CUAIStudents/Ar-ASR](https://huggingface.co/datasets/CUAIStudents/Ar-ASR) dataset for optimal performance. |
|
|
| **Key Features:** |
| - **High-fidelity transcription**: Transcribes exactly what is pronounced, maintaining authenticity of speech patterns |
| - **Speech improvement tool**: Designed to help users identify and correct speech patterns |
| - **Superior performance**: Outperforms many existing Arabic ASR models based on Whisper and its variants |
| - **Arabic with Tashkil**: Provides accurate diacritization for comprehensive Arabic text output |
|
|
| ## What Makes DeepAr Different |
|
|
| Unlike traditional ASR models that normalize speech to standard text, DeepAr transcribes **exactly what is pronounced**. This unique approach makes it particularly valuable for: |
|
|
| - **Speech therapy and improvement**: Identifies pronunciation patterns and deviations |
| - **Language learning**: Helps learners understand their actual pronunciation vs. intended speech |
| - **Linguistic research**: Captures authentic speech patterns for analysis |
| - **Pronunciation assessment**: Provides detailed feedback on spoken Arabic |
|
|
| ## Model Details |
|
|
| - **Base Architecture**: whisper-turbo-v3 |
| - **Language**: Arabic (with Tashkil/diacritics) |
| - **Task**: High-fidelity Automatic Speech Recognition |
| - **Training Data**: Complete [CUAIStudents/Ar-ASR](https://huggingface.co/datasets/CUAIStudents/Ar-ASR) dataset |
| - **Model Type**: Production-ready, latest version |
|
|
| ## Performance |
|
|
| DeepAr demonstrates superior performance compared to many Arabic ASR models built on Whisper and its variants, particularly excelling in: |
| - Pronunciation accuracy detection |
| - Diacritic prediction |
| - Handling of Arabic speech variations |
| - Authentic speech pattern recognition |
|
|
| ## Intended Use |
|
|
| This model is ideal for: |
| - Speech therapy and pronunciation correction applications |
| - Arabic language learning platforms |
| - Linguistic research and analysis |
| - Educational tools for speech improvement |
| - Applications requiring authentic speech transcription |
| - Quality assessment of spoken Arabic |
|
|
| ## Usage |
|
|
| ### Installation |
|
|
| ```bash |
| pip install transformers torch torchaudio |
| ``` |
|
|
| ### Quick Start |
|
|
| ```python |
| from transformers import WhisperProcessor, WhisperForConditionalGeneration |
| import torch |
| import torchaudio |
| |
| # Load model and processor |
| processor = WhisperProcessor.from_pretrained("CUAIStudents/DeepAr") |
| model = WhisperForConditionalGeneration.from_pretrained("CUAIStudents/DeepAr") |
| |
| # Load and preprocess audio |
| audio_path = "path_to_your_arabic_audio.wav" |
| waveform, sample_rate = torchaudio.load(audio_path) |
| |
| # Resample to 16kHz if necessary |
| if sample_rate != 16000: |
| resampler = torchaudio.transforms.Resample(sample_rate, 16000) |
| waveform = resampler(waveform) |
| |
| # Process audio |
| input_features = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_features |
| |
| # Generate transcription |
| with torch.no_grad(): |
| predicted_ids = model.generate(input_features, language="ar") |
| |
| # Decode transcription (exactly as pronounced) |
| transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] |
| print(f"Pronounced as: {transcription}") |
| ``` |
|
|
| ### Speech Analysis Example |
|
|
| ```python |
| def analyze_pronunciation(audio_path, target_text=None): |
| """ |
| Analyze pronunciation and compare with target text if provided |
| """ |
| waveform, sample_rate = torchaudio.load(audio_path) |
| |
| if sample_rate != 16000: |
| resampler = torchaudio.transforms.Resample(sample_rate, 16000) |
| waveform = resampler(waveform) |
| |
| input_features = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_features |
| |
| with torch.no_grad(): |
| predicted_ids = model.generate(input_features, language="ar") |
| |
| actual_pronunciation = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] |
| |
| print(f"Actual pronunciation: {actual_pronunciation}") |
| |
| if target_text: |
| print(f"Target text: {target_text}") |
| print("Analysis: Compare the differences for speech improvement") |
| |
| return actual_pronunciation |
| |
| # Example usage |
| pronunciation = analyze_pronunciation("student_reading.wav", "النص المطلوب قراءته") |
| ``` |
|
|
| ### Batch Processing for Speech Assessment |
|
|
| ```python |
| def assess_multiple_recordings(audio_files, target_texts=None): |
| """ |
| Process multiple recordings for comprehensive speech assessment |
| """ |
| results = [] |
| |
| for i, audio_file in enumerate(audio_files): |
| waveform, sample_rate = torchaudio.load(audio_file) |
| |
| if sample_rate != 16000: |
| resampler = torchaudio.transforms.Resample(sample_rate, 16000) |
| waveform = resampler(waveform) |
| |
| input_features = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_features |
| |
| with torch.no_grad(): |
| predicted_ids = model.generate(input_features, language="ar") |
| |
| pronunciation = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] |
| |
| result = { |
| 'file': audio_file, |
| 'pronunciation': pronunciation, |
| 'target': target_texts[i] if target_texts else None |
| } |
| results.append(result) |
| |
| print(f"File {i+1}: {pronunciation}") |
| |
| return results |
| |
| # Example usage |
| audio_files = ["recording1.wav", "recording2.wav", "recording3.wav"] |
| target_texts = ["النص الأول", "النص الثاني", "النص الثالث"] |
| assessment_results = assess_multiple_recordings(audio_files, target_texts) |
| ``` |
|
|
|
|
| ## Training Data |
|
|
| This model was trained on the complete [CUAIStudents/Ar-ASR](https://huggingface.co/datasets/CUAIStudents/Ar-ASR) dataset, utilizing the full scope of available Arabic speech data with corresponding high-quality transcriptions including diacritics. |
|
|
| ## Model Advantages |
|
|
| - **Authentic transcription**: Captures exactly what is spoken, not what should be spoken |
| - **High accuracy**: Superior performance compared to similar Whisper-based Arabic models |
| - **Comprehensive training**: Utilizes the complete dataset for optimal coverage |
| - **Practical applications**: Specifically designed for speech improvement and assessment |
| - **Diacritic accuracy**: Excellent performance in Arabic diacritization |
|
|
|
|
| ## Limitations |
|
|
| - **MSA focus**: Optimized primarily for Modern Standard Arabic (MSA) rather than dialectal variations |
|
|
| ## License |
|
|
| This model is released under the MIT License. |
|
|
| ``` |
| MIT License |
| |
| Copyright (c) 2024 CUAIStudents |
| |
| Permission is hereby granted, free of charge, to any person obtaining a copy |
| of this software and associated documentation files (the "Software"), to deal |
| in the Software without restriction, including without limitation the rights |
| to use, copy, modify, merge, publish, distribute, sublicense, and/or sell |
| copies of the Software, and to permit persons to whom the Software is |
| furnished to do so, subject to the following conditions: |
| |
| The above copyright notice and this permission notice shall be included in all |
| copies or substantial portions of the Software. |
| |
| THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR |
| IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, |
| FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE |
| AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER |
| LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, |
| OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE |
| SOFTWARE. |
| ``` |
|
|