Zeineuski β€” Basque Dialect Identification (fastText)

A hierarchical fastText model for identifying Basque (Euskara) dialects from text. Classifies into 6 categories: 5 regional dialects (euskalkiak) + standard Batua.

Dialect Taxonomy

Class Dialect Region
batua Batua (Standard Basque) Euskal Herria-wide
western Mendebaldekoa / Bizkaiera Bizkaia, western Gipuzkoa
central Erdialdekoa / Gipuzkera Gipuzkoa
navarrese Nafarrera Navarre
nav-lab Nafar-Lapurtera Lapurdi, Low Navarre
souletin Zuberera Zuberoa

Architecture

Hierarchical 2-step classifier:

  1. Binary model (hier_binary_*.bin): batua vs dialectal
  2. Dialect model (hier_dialect_*.bin): 5-class euskalkiak classification

The binary step eliminates batua-vs-dialect confusion β€” the dialect model never sees batua samples and only needs to distinguish among the 5 regional varieties.

Performance

Metric Value
XNLI 3c cross-domain 96.73% (5-class ceiling: 96.85%)
6-class test accuracy 97.83%
Batua F1 0.962
Western F1 0.976
Central F1 0.958
Nav-Lab F1 0.968
EuskanolDS (code-switched tweets) 94.2% high-confidence Batua detection

Usage

import fasttext

# Load models
binary_model = fasttext.load_model("hier_binary_final.bin")
dialect_model = fasttext.load_model("hier_dialect_final.bin")

def predict_dialect(text: str, threshold: float = 0.7) -> dict:
    """Two-step hierarchical prediction."""
    labels, probs = binary_model.predict(text, k=1)
    if labels[0] == "__label__batua":
        return {"dialect": "batua", "confidence": float(probs[0])}
    
    labels, probs = dialect_model.predict(text, k=1)
    dialect = labels[0].replace("__label__", "")
    conf = float(probs[0])
    
    if conf < threshold:
        return {"dialect": "uncertain", "confidence": conf}
    return {"dialect": dialect, "confidence": conf}

# Example
print(predict_dialect("Gaur goizean goiz jaiki naiz"))
# β†’ {"dialect": "central", "confidence": 0.92}

Training Details

  • Architecture: fastText (Facebook Research), skipgram with character n-grams
  • Training data: 29,977 sentences (5 euskalkiak from Klasikoak + 15K EITB Batua news)
  • Binary model: lr=3.0, dim=100, epoch=50, wordNgrams=2, minn=3, maxn=6
  • Dialect model: lr=0.2, dim=100, epoch=150, wordNgrams=2, minn=3, maxn=6
  • Training time: ~57s total on CPU (4 threads)
  • Optimization: 33 autoresearch experiments across 2 segments

Limitations

  • Navarrese and Souletin are not in the 6-class test set (evaluated via validation only)
  • Code-switched text (Basque/Spanish) may produce low-confidence predictions
  • Transition zone dialects (Debagoiena, Bidasoa) may have overlapping predictions
  • Minimal text length recommended: β‰₯20 characters for reliable predictions

Citation

If you use this model, please cite:

@software{zeineuski2026,
  author = {Zeineuski Team},
  title = {Zeineuski: Basque Dialect Identification},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/itzune/zeineuski}
}

License

MIT β€” see LICENSE file.

Downloads last month
809
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support