Model Overview
This model is a fine-tuned version of the cmarkea/distilcamembert-base, adapted for binary text classification in French.
Model Type
- Architecture:
CamembertForSequenceClassification
- Base Model: DistilCamemBERT
- Number of Layers: 6 hidden layers, 12 attention heads
- Tokenizer: Based on CamemBERT's tokenizer
- Vocab Size: 32,005 tokens
Intended Use
This model is designed for classifying sentences as either travel-related or non-travel-related, with high accuracy on French datasets.
Example Use Case:
Given a sentence such as "Je veux aller de Paris à Lyon", the model will detect and label:
POSITIVE
aslabel
0.9999655485153198
asscore
Given a sentence such as "Je veux acheter du pain", the model will detect and label:
NEGATIVE
aslabel
0.9999724626541138
asscore
Limitations:
- Language: Optimized for French text, performance on other languages is not guaranteed.
- Performance: Specifically trained for binary classification. Performance may degrade on multi-class or unrelated tasks.
Labels
The model uses the following entity labels:
POSITIVE
: Travel-related sentencesNEGATIVE
: Non-travel-related sentences
Training Data
The model was fine-tuned using a proprietary French dataset: Crysy-rthomas/T-AIA-CLASSIFICATION-DATASET. This dataset contains thousands of labeled examples for travel and non-travel sentences.
Hyperparameters and Fine-Tuning:
- Learning Rate: 5e-5
- Batch Size: 16
- Epochs: 3
- Evaluation Strategy: Epoch-based
- Optimizer: AdamW
Tokenizer
The tokenizer is based on the pre-trained CamemBERT tokenizer, adapted for the specific entity-labeling task. It uses subword tokenization based on the BPE (Byte-Pair Encoding) approach, which splits words into smaller units.
Tokenizer special settings:
- Max Length: 128
- Padding: Right-padded to 128 tokens
- Truncation: Longest-first strategy, truncating tokens beyond 128.
How to Use
You can load and use this model with Hugging Face’s transformers
library and use pipeline function for creating a text classification pipeline as follows:
from transformers import pipeline
model_path = "InesPL84/T-AIA-DISTILCAMEMBERT-BASE-TEXT-CLASSIFICATION"
classifier = pipeline("text-classification", model=model_path, tokenizer=model_path)
sentence = "Je veux aller de Paris à Lyon"
result = classifier(sentence)
print(result)
Limitations and Bias
While the model performs well on the training and test datasets, there are some known limitations:
- Bias in Dataset: Performance may reflect the biases in the training data.
- Generalization: Results may be biased towards specific named entities frequently seen in the training data (such as city names).
License
This model is released under the Apache 2.0 License.
- Downloads last month
- 54
Model tree for InesPL84/T-AIA-DISTILCAMEMBERT-BASE-TEXT-CLASSIFICATION
Base model
cmarkea/distilcamembert-base