Purpose:

This model is a query classifer for the Arabic Language, which can be used alone or within a Haystack pipeline. It returns a 0 for a query of words and 1 for a fully-formed question.

It was built in three steps.

  1. Take the same useful Kaggle training data that Sharukh used, and only take the 'dev.csv' data, which is more than sufficient. Split that later into a new set of trian, val, and test sets. Translate it into Arabic using the Seq2Seq translation model "facebook/m2m100_1.2B". The priority was to have syntactially correct translations, and not necessarily semantically correct. In that sense, for word queries the words were translated individually and recombined into one string. The questions were translated as-is, and sometimes the results were a mix of Arabic and English (this is, I think, due to the details of the m2m model's vocab size and tokenizer). About 28% of the training data had question marks written explicitly.

  2. Use the model ARBERT as the base, and finetune on the above data.

  3. Distill the above model into a smaller size. I was not very succesful in reducing the size significaly, although I reduced the hidden layers from 12 to 4.

Results of testing on distilled model:

Measure Score
'accuracy': 0.981
'precision': 0.983
'recall': 0.979
'roc_auc': 0.981
'f1': 0.981
'matthews': 0.962
'mse': 0.01876
'brier': 0.01876

(In this case Brier is the same as MSE because there are only 2 labels)

Thanks:

This model was inspired by this Github thread wherein making a query classifer model is discussed, and also Sharukh Khan's resulting English model based on DistilBert.

Regarding the model distillation, I owe thanks to the following sources for the distillation:

Knowledge Distillation article by Phil Schmid

Articles by Remi Reboul:

Distillation Part 1

Distillation Part 2

Downloads last month
2
Inference Examples
Inference API (serverless) does not yet support tf-keras models for this pipeline type.