|
--- |
|
license: apache-2.0 |
|
--- |
|
# Fine-Tuning Pre-Trained Model for English and Albanian |
|
|
|
This project demonstrates the process of fine-tuning a pre-trained model for language tasks in both **English** and **Albanian**. We utilize transfer learning with a pre-trained model (e.g., BERT or multilingual BERT) to adapt it for specific tasks in these two languages, such as text classification, named entity recognition (NER), or sentiment analysis. |
|
|
|
## Requirements |
|
|
|
### Prerequisites |
|
- Python 3.7+ |
|
- TensorFlow or PyTorch |
|
- Hugging Face Transformers library |
|
- CUDA-enabled GPU (recommended for faster training) |
|
|
|
### Dependencies |
|
Install the following Python libraries using `pip`: |
|
|
|
```bash |
|
pip install torch transformers datasets |
|
pip install tensorflow # If using TensorFlow |
|
pip install tqdm |
|
pip install scikit-learn |
|
Model Overview |
|
We fine-tuned a pre-trained multilingual model (e.g., BERT Multilingual, mBERT, or XLM-RoBERTa) to perform NLP tasks in both English and Albanian. These models are pre-trained on multiple languages, including English and Albanian, and are then fine-tuned on a custom dataset tailored to your task. |
|
|
|
Example Pre-Trained Models: |
|
bert-base-multilingual-cased |
|
xlm-roberta-base |
|
Fine-Tuning Process |
|
1. Load the Pre-Trained Model and Tokenizer |
|
python |
|
Copy code |
|
from transformers import BertTokenizer, BertForSequenceClassification |
|
|
|
# Load the pre-trained multilingual model |
|
model_name = 'bert-base-multilingual-cased' |
|
tokenizer = BertTokenizer.from_pretrained(model_name) |
|
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2) # Adjust num_labels based on your task |
|
2. Prepare the Dataset |
|
You can fine-tune the model on your own dataset (in English and Albanian) using Hugging Face’s datasets library, or prepare your own dataset in CSV or JSON format. |
|
|
|
Example: |
|
|
|
python |
|
Copy code |
|
from datasets import load_dataset |
|
|
|
# Load the dataset (replace with your own dataset) |
|
dataset = load_dataset('csv', data_files='path_to_your_data.csv') |
|
3. Preprocess the Data |
|
Use the tokenizer to preprocess the dataset, converting text into token IDs compatible with the pre-trained model. |
|
|
|
python |
|
Copy code |
|
def preprocess_function(examples): |
|
return tokenizer(examples['text'], padding='max_length', truncation=True) |
|
|
|
# Apply preprocessing |
|
tokenized_datasets = dataset.map(preprocess_function, batched=True) |
|
4. Fine-Tuning the Model |
|
Train the model on your dataset using either PyTorch or TensorFlow. Here's an example using PyTorch: |
|
|
|
python |
|
Copy code |
|
from torch.utils.data import DataLoader |
|
from transformers import AdamW |
|
|
|
# Set training parameters |
|
train_dataset = tokenized_datasets['train'] |
|
train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True) |
|
|
|
# Set optimizer |
|
optimizer = AdamW(model.parameters(), lr=2e-5) |
|
|
|
# Training loop |
|
model.train() |
|
for epoch in range(3): |
|
for batch in train_dataloader: |
|
optimizer.zero_grad() |
|
input_ids = batch['input_ids'].to(device) |
|
labels = batch['labels'].to(device) |
|
outputs = model(input_ids, labels=labels) |
|
loss = outputs.loss |
|
loss.backward() |
|
optimizer.step() |
|
print(f"Epoch {epoch}, Loss: {loss.item()}") |
|
5. Evaluate the Model |
|
After training, evaluate the model’s performance using the validation or test dataset. |
|
|
|
python |
|
Copy code |
|
from sklearn.metrics import accuracy_score |
|
|
|
model.eval() |
|
# Example evaluation loop |
|
predictions = [] |
|
labels = [] |
|
for batch in eval_dataloader: |
|
with torch.no_grad(): |
|
input_ids = batch['input_ids'].to(device) |
|
labels.append(batch['labels'].numpy()) |
|
outputs = model(input_ids) |
|
preds = torch.argmax(outputs.logits, dim=-1) |
|
predictions.append(preds.numpy()) |
|
|
|
accuracy = accuracy_score(labels, predictions) |
|
print(f"Accuracy: {accuracy}") |
|
Languages Supported |
|
English: The model is fine-tuned on English text for the task at hand (e.g., text classification, sentiment analysis, etc.). |
|
Albanian: The same model can be used for Albanian text, leveraging multilingual pre-trained weights. The performance may vary depending on the dataset, but mBERT and XLM-R are known to perform well for Albanian. |
|
Results |
|
This fine-tuned model provides state-of-the-art performance on both English and Albanian tasks. Results on the validation/test set should demonstrate good generalization across these two languages. |
|
|
|
Example Results: |
|
|
|
Accuracy: 85% on English dataset |
|
Accuracy: 80% on Albanian dataset |
|
Conclusion |
|
By fine-tuning a pre-trained multilingual model, we significantly reduce the time and computational resources required for training a model from scratch. This approach leverages transfer learning, where the model has already learned general linguistic patterns from a wide variety of languages, allowing it to adapt to specific tasks in both English and Albanian. |
|
|
|
License |
|
This project is licensed under the MIT License - see the LICENSE file for details. |
|
|
|
|
|
|