Instructions to use itzune/zeineuski with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- fastText
How to use itzune/zeineuski with fastText:
from huggingface_hub import hf_hub_download import fasttext model = fasttext.load_model(hf_hub_download("itzune/zeineuski", "model.bin")) - Notebooks
- Google Colab
- Kaggle
Zeineuski β Basque Dialect Identification (fastText)
A hierarchical fastText model for identifying Basque (Euskara) dialects from text. Classifies into 6 categories: 5 regional dialects (euskalkiak) + standard Batua.
Dialect Taxonomy
| Class | Dialect | Region |
|---|---|---|
batua |
Batua (Standard Basque) | Euskal Herria-wide |
western |
Mendebaldekoa / Bizkaiera | Bizkaia, western Gipuzkoa |
central |
Erdialdekoa / Gipuzkera | Gipuzkoa |
navarrese |
Nafarrera | Navarre |
nav-lab |
Nafar-Lapurtera | Lapurdi, Low Navarre |
souletin |
Zuberera | Zuberoa |
Architecture
Hierarchical 2-step classifier:
- Binary model (
hier_binary_*.bin): batua vs dialectal - Dialect model (
hier_dialect_*.bin): 5-class euskalkiak classification
The binary step eliminates batua-vs-dialect confusion β the dialect model never sees batua samples and only needs to distinguish among the 5 regional varieties.
Performance
| Metric | Value |
|---|---|
| XNLI 3c cross-domain | 96.73% (5-class ceiling: 96.85%) |
| 6-class test accuracy | 97.83% |
| Batua F1 | 0.962 |
| Western F1 | 0.976 |
| Central F1 | 0.958 |
| Nav-Lab F1 | 0.968 |
| EuskanolDS (code-switched tweets) | 94.2% high-confidence Batua detection |
Usage
import fasttext
# Load models
binary_model = fasttext.load_model("hier_binary_final.bin")
dialect_model = fasttext.load_model("hier_dialect_final.bin")
def predict_dialect(text: str, threshold: float = 0.7) -> dict:
"""Two-step hierarchical prediction."""
labels, probs = binary_model.predict(text, k=1)
if labels[0] == "__label__batua":
return {"dialect": "batua", "confidence": float(probs[0])}
labels, probs = dialect_model.predict(text, k=1)
dialect = labels[0].replace("__label__", "")
conf = float(probs[0])
if conf < threshold:
return {"dialect": "uncertain", "confidence": conf}
return {"dialect": dialect, "confidence": conf}
# Example
print(predict_dialect("Gaur goizean goiz jaiki naiz"))
# β {"dialect": "central", "confidence": 0.92}
Training Details
- Architecture: fastText (Facebook Research), skipgram with character n-grams
- Training data: 29,977 sentences (5 euskalkiak from Klasikoak + 15K EITB Batua news)
- Binary model: lr=3.0, dim=100, epoch=50, wordNgrams=2, minn=3, maxn=6
- Dialect model: lr=0.2, dim=100, epoch=150, wordNgrams=2, minn=3, maxn=6
- Training time: ~57s total on CPU (4 threads)
- Optimization: 33 autoresearch experiments across 2 segments
Limitations
- Navarrese and Souletin are not in the 6-class test set (evaluated via validation only)
- Code-switched text (Basque/Spanish) may produce low-confidence predictions
- Transition zone dialects (Debagoiena, Bidasoa) may have overlapping predictions
- Minimal text length recommended: β₯20 characters for reliable predictions
Citation
If you use this model, please cite:
@software{zeineuski2026,
author = {Zeineuski Team},
title = {Zeineuski: Basque Dialect Identification},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/itzune/zeineuski}
}
License
MIT β see LICENSE file.
- Downloads last month
- 809