Model Overview

This model is a fine-tuned version of the cmarkea/distilcamembert-base, adapted for binary text classification in French.

Model Type

  • Architecture: CamembertForSequenceClassification
  • Base Model: DistilCamemBERT
  • Number of Layers: 6 hidden layers, 12 attention heads
  • Tokenizer: Based on CamemBERT's tokenizer
  • Vocab Size: 32,005 tokens

Intended Use

This model is designed for classifying sentences as either travel-related or non-travel-related, with high accuracy on French datasets.

Example Use Case:

Given a sentence such as "Je veux aller de Paris à Lyon", the model will detect and label:

  • POSITIVE as label
  • 0.9999655485153198 as score

Given a sentence such as "Je veux acheter du pain", the model will detect and label:

  • NEGATIVE as label
  • 0.9999724626541138 as score

Limitations:

  • Language: Optimized for French text, performance on other languages is not guaranteed.
  • Performance: Specifically trained for binary classification. Performance may degrade on multi-class or unrelated tasks.

Labels

The model uses the following entity labels:

  • POSITIVE: Travel-related sentences
  • NEGATIVE: Non-travel-related sentences

Training Data

The model was fine-tuned using a proprietary French dataset: Crysy-rthomas/T-AIA-CLASSIFICATION-DATASET. This dataset contains thousands of labeled examples for travel and non-travel sentences.

Hyperparameters and Fine-Tuning:

  • Learning Rate: 5e-5
  • Batch Size: 16
  • Epochs: 3
  • Evaluation Strategy: Epoch-based
  • Optimizer: AdamW

Tokenizer

The tokenizer is based on the pre-trained CamemBERT tokenizer, adapted for the specific entity-labeling task. It uses subword tokenization based on the BPE (Byte-Pair Encoding) approach, which splits words into smaller units.

Tokenizer special settings:

  • Max Length: 128
  • Padding: Right-padded to 128 tokens
  • Truncation: Longest-first strategy, truncating tokens beyond 128.

How to Use

You can load and use this model with Hugging Face’s transformers library and use pipeline function for creating a text classification pipeline as follows:

from transformers import pipeline

model_path = "InesPL84/T-AIA-DISTILCAMEMBERT-BASE-TEXT-CLASSIFICATION"
classifier = pipeline("text-classification", model=model_path, tokenizer=model_path)

sentence = "Je veux aller de Paris à Lyon"
result = classifier(sentence)
print(result)

Limitations and Bias

While the model performs well on the training and test datasets, there are some known limitations:

  • Bias in Dataset: Performance may reflect the biases in the training data.
  • Generalization: Results may be biased towards specific named entities frequently seen in the training data (such as city names).

License

This model is released under the Apache 2.0 License.

Downloads last month
54
Inference API
Unable to determine this model's library. Check the docs .

Model tree for InesPL84/T-AIA-DISTILCAMEMBERT-BASE-TEXT-CLASSIFICATION

Finetuned
(5)
this model

Dataset used to train InesPL84/T-AIA-DISTILCAMEMBERT-BASE-TEXT-CLASSIFICATION