| --- |
| license: mit |
| tags: |
| - bert |
| - morphological-analysis |
| - kyrgyz |
| - nlp |
| - pos-tagging |
| - low-resource-languages |
| - token-classification |
| language: |
| - ky |
| pipeline_tag: token-classification |
| --- |
| |
| # Kyrgyz Morphological Analysis β BERT |
|
|
| <p align="center"> |
| <img src="image_2023-05-13_16-58-05.png" alt="Morphological analysis example" width="600"/> |
| </p> |
|
|
| ## Model Description |
|
|
| A **BERT-based morphological analyzer** for the **Kyrgyz language** β a low-resource Turkic language spoken by ~5 million people. The model performs morphological tagging, predicting grammatical features (POS tags, case, number, tense, etc.) for each token in a sentence. |
|
|
| Kyrgyz is an agglutinative language with rich morphology, making morphological analysis particularly challenging and valuable for downstream NLP tasks. |
|
|
| ## Performance |
|
|
| | Model | Accuracy | |
| |-------|----------| |
| | **BERT (fine-tuned)** | **~80%** | |
| | Logistic Regression (baseline) | β | |
|
|
| <!-- π§ TODO: Add baseline accuracy if available --> |
|
|
| ## Intended Use |
|
|
| | Use Case | Description | |
| |----------|-------------| |
| | **Kyrgyz NLP pipeline** | Morphological preprocessing for machine translation, text analysis | |
| | **Linguistic research** | Studying Kyrgyz grammar and morphological patterns | |
| | **Education** | Teaching Kyrgyz morphology with automated analysis | |
| | **Downstream tasks** | Improving NER, dependency parsing, and sentiment analysis for Kyrgyz | |
|
|
| ## Training Details |
|
|
| ### Dataset |
|
|
| - **Format:** CSV with morphological annotations |
| - **Train set:** `train_fixed.csv` |
| - **Test set:** `test_fixed.csv` |
| - **Tag set:** Defined in `TAG.docx` (morphological tag inventory) |
|
|
| ### Architecture |
|
|
| - **Base model:** BERT (fine-tuned for token classification) |
| - **Custom variant:** `bert_model_variant.py` |
| - **Baseline:** Logistic Regression (`logistic_regression.ipynb`) |
|
|
| ### Framework |
|
|
| - Python 3.10+ |
| - PyTorch / Transformers (HuggingFace) |
|
|
| ## Repository Structure |
|
|
| ``` |
| βββ bert_model_variant.py # Custom BERT model architecture |
| βββ train.py # Training script |
| βββ dev.py # Evaluation script |
| βββ dev.ipynb # Development notebook |
| βββ logistic_regression.ipynb # Baseline model |
| βββ train_fixed.csv # Training data |
| βββ test_fixed.csv # Test data |
| βββ TAG.docx # Morphological tag definitions |
| ``` |
|
|
| ## How to Use |
|
|
| ```python |
| # Load and run inference |
| from bert_model_variant import MorphAnalyzer # adjust import as needed |
| |
| # Example: Analyze Kyrgyz text |
| text = "ΠΡΡΠ³ΡΠ·ΡΡΠ°Π½ β ΠΊΠΎΠΎΠ· Σ©Π»ΠΊΣ©" |
| # See train.py and dev.py for full inference pipeline |
| ``` |
|
|
| <!-- π§ TODO: Add a more complete inference example --> |
|
|
| ## Why This Matters |
|
|
| Kyrgyz is an **underrepresented language** in NLP. Most morphological analyzers exist only for high-resource languages. This model contributes to: |
|
|
| - Building foundational NLP tools for the Kyrgyz language |
| - Enabling more complex downstream applications (MT, QA, summarization) |
| - Preserving and digitizing Kyrgyz linguistic knowledge |
|
|
| ## Limitations |
|
|
| - Accuracy of ~80% means roughly 1 in 5 tokens may be mistagged |
| - Performance may vary across different text domains and registers |
| - Limited to the morphological tag set defined in `TAG.docx` |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{kyrgyz_morph_2023, |
| author = {Zarina}, |
| title = {BERT-based Morphological Analyzer for Kyrgyz Language}, |
| year = {2023}, |
| url = {https://huggingface.co/Zarinaaa/morphological_analysis} |
| } |
| ``` |
|
|
| ## Author |
|
|
| **Zarina** β ML Engineer specializing in NLP and Speech Technologies for low-resource languages. |
|
|
| - π€ [HuggingFace](https://huggingface.co/Zarinaaa) |
| - πΌ [LinkedIn](https://linkedin.com/in/YOUR_LINKEDIN) |