BERT-Text2Date / README.md
Parssky's picture
Update README.md
6447be0 verified
metadata
tags:
  - model_hub_mixin
  - pytorch_model_hub_mixin
  - BERT
  - DATE
  - Persian
  - Transformer
  - Pytorch
license: mit
language:
  - en
  - fa

This model has been pushed to the Hub using the PytorchModelHubMixin integration:

Model Card for BERT-Text2Date

Model Overview

Model Name: BERT-Text2Date Model Type: BERT (Encoder-only architecture)
Language: Persian

Description:
This model is designed to process and generate Persian dates in both formal (YYYY-MM-DD) and informal formats. It utilizes a dataset that includes various representations of dates, allowing for effective training in understanding and predicting Persian date formats.

Fullcode On github: https://github.com/parssky/BERT-Date2Text (Training - Dataset - Infrence)

Dataset

Dataset Description:
The dataset consists of two types of dates: formal and informal. It is generated using two main functions:

  • convert_year_to_persian(year): Converts years to Persian format, currently supporting the year 1400.
  • generate_date_mappings_with_persian_year(start_year, end_year): Generates dates for a specified range, considering the number of days in each month.

Data Formats:

  • Informal Dates: Various formats like “روز X ماه سال” and “اول/دوم/… ماه سال”.
  • Formal Dates: Stored in YYYY-MM-DD format.

Example Dates:

  • بیست و هشتم اسفند هزار و چهار صد و ده, 1410-12-28
  • 1 فروردین 1400, 1400-01-01

Data Split:

  • Training Set: 80% (19272 samples)
  • Validation Set: 10% (2409 samples)
  • Test Set: 10% (2409 samples)

Model Architecture

Architecture Details:
The model is built using an encoder-only architecture, consisting of:

  • Layers: 4 Encoder layers
  • Parameters:
    • vocab_size: 25003
    • context_length: 32
    • emb_dim: 256
    • n_heads: 4
    • drop_rate: 0.1

Parameter Count: 14,933,931

Transformer( (embedding): Embedding(25003, 256) (positional_encoding): Embedding(32, 256) (en): TransformerEncoder( (layers): ModuleList( (0-3): 4 x TransformerEncoderLayer( (self_attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=False) ) (linear1): Linear(in_features=256, out_features=512, bias=False) (dropout): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=512, out_features=256, bias=False) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (dropout1): Dropout(p=0.1, inplace=False) (dropout2): Dropout(p=0.1, inplace=False) ) ) ) (fc_train): Linear(in_features=256, out_features=25003, bias=True) )

Tokenizer:
The model uses a Persian tokenizer named “بلبل زبان” available on Hugging Face, with a vocabulary size of 25,000 tokens.

Training

Training Process:

  • Batch Size: 2048
  • Epochs: 60
  • Learning Rate: 0.00005
  • Optimizer: AdamW
  • Weight Decay: 0.2
  • Masking Technique: The formal part of the date is masked to facilitate learning.

Performance Metrics:

  • Training Loss: Reduced from 10.3 to 0.005 over 60 epochs.
  • Validation Loss: Reduced from 10.1 to 0.010.
  • Test Accuracy: 79% (exact match required).
  • Perplexity: 1.01

Inference

Inference Code:
The model can be loaded along with the tokenizer using the provided Inference.ipynb file. Three functions are implemented:

  1. Convert Token IDs to Text
def text_to_token_ids(text, tokenizer):

    encoded = tokenizer.encode(text)
    
    encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension

return encoded_tensor
  1. Convert Text to Token IDs
def token_ids_to_text(token_ids, tokenizer):

    flat = token_ids.squeeze(0) # remove batch dimension

return tokenizer.decode(flat.tolist())
  1. predict_masked(input): Takes an input to predict the masked date.
def predict_masked(model,tokenizer,input,deivce):

    model.eval()
    
    inputs_masked = input + " " + "[MASK][MASK][MASK][MASK]-[MASK][MASK]-[MASK][MASK]"
    
    input_ids = tokenizer.encode(inputs_masked)
    
    input_ids = torch.tensor(input_ids).to(deivce)
    
    with torch.no_grad():
    
    logits = model(input_ids.unsqueeze(0))
    
    logits = logits.flatten(0, 1)
    
    probs = torch.argmax(logits,dim=-1,keepdim=True)
    
    token_ids = probs.squeeze(1)
    
    answer_ids = token_ids[-11:-1]

return token_ids_to_text(answer_ids,tokenizer)

And use:

predict_masked(model,tokenizer,"12 آبان 1402","cuda")

Output:

'1402-08-12'

Limitations

  • The model currently only supports Persian dates for the year 1400-1410, with potential for expansion.
  • Performance may vary with dates outside the training dataset.

Intended Use

This model is intended for applications requiring date recognition and generation in Persian, such as natural language processing tasks, chatbots, or educational tools.

Acknowledgements

  • Special thanks to the developers of the “بلبل زبان” tokenizer and the contributors to the dataset.