File size: 5,224 Bytes
4d2b138
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4886268
 
 
 
 
 
 
 
 
 
 
 
 
6447be0
4886268
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2879b0c
4886268
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4d2b138
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
tags:
- model_hub_mixin
- pytorch_model_hub_mixin
- BERT
- DATE
- Persian
- Transformer
- Pytorch
license: mit
language:
- en
- fa
---

This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:


# Model Card for BERT-Text2Date

## Model Overview

**Model Name:** BERT-Text2Date
**Model Type:** BERT (Encoder-only architecture)  
**Language:** Persian

**Description:**  
This model is designed to process and generate Persian dates in both formal (YYYY-MM-DD) and informal formats. It utilizes a dataset that includes various representations of dates, allowing for effective training in understanding and predicting Persian date formats.

Fullcode On github: https://github.com/parssky/BERT-Date2Text (Training - Dataset - Infrence)
## Dataset

**Dataset Description:**  
The dataset consists of two types of dates: formal and informal. It is generated using two main functions:

- **`convert_year_to_persian(year)`**: Converts years to Persian format, currently supporting the year 1400.
- **`generate_date_mappings_with_persian_year(start_year, end_year)`**: Generates dates for a specified range, considering the number of days in each month.

**Data Formats:**

- **Informal Dates:** Various formats like “روز X ماه سال” and “اول/دوم/… ماه سال”.
- **Formal Dates:** Stored in YYYY-MM-DD format.

**Example Dates:**

- بیست و هشتم اسفند هزار و چهار صد و ده, 1410-12-28
- 1 فروردین 1400, 1400-01-01

**Data Split:**

- **Training Set:** 80% (19272 samples)
- **Validation Set:** 10% (2409 samples)
- **Test Set:** 10% (2409 samples)

## Model Architecture

**Architecture Details:**  
The model is built using an encoder-only architecture, consisting of:

- **Layers:** 4 Encoder layers
- **Parameters:**
    - `vocab_size`: 25003
    - `context_length`: 32
    - `emb_dim`: 256
    - `n_heads`: 4
    - `drop_rate`: 0.1

**Parameter Count:** 14,933,931

```
Transformer( (embedding): Embedding(25003, 256) (positional_encoding): Embedding(32, 256) (en): TransformerEncoder( (layers): ModuleList( (0-3): 4 x TransformerEncoderLayer( (self_attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=False) ) (linear1): Linear(in_features=256, out_features=512, bias=False) (dropout): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=512, out_features=256, bias=False) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (dropout1): Dropout(p=0.1, inplace=False) (dropout2): Dropout(p=0.1, inplace=False) ) ) ) (fc_train): Linear(in_features=256, out_features=25003, bias=True) )
```

**Tokenizer:**  
The model uses a Persian tokenizer named “بلبل زبان” available on Hugging Face, with a vocabulary size of 25,000 tokens.

## Training

**Training Process:**

- **Batch Size:** 2048
- **Epochs:** 60
- **Learning Rate:** 0.00005
- **Optimizer:** AdamW
- **Weight Decay:** 0.2
- **Masking Technique:** The formal part of the date is masked to facilitate learning.

**Performance Metrics:**

- **Training Loss:** Reduced from 10.3 to 0.005 over 60 epochs.
- **Validation Loss:** Reduced from 10.1 to 0.010.
- **Test Accuracy:** 79% (exact match required).
- **Perplexity:** 1.01

## Inference

**Inference Code:**  
The model can be loaded along with the tokenizer using the provided `Inference.ipynb` file. Three functions are implemented:

1. **Convert Token IDs to Text**
```python
def text_to_token_ids(text, tokenizer):

	encoded = tokenizer.encode(text)
	
	encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension

return encoded_tensor
```

2. **Convert Text to Token IDs**
```python
def token_ids_to_text(token_ids, tokenizer):

	flat = token_ids.squeeze(0) # remove batch dimension

return tokenizer.decode(flat.tolist())
```

3. **`predict_masked(input)`**: Takes an input to predict the masked date.
```python
def predict_masked(model,tokenizer,input,deivce):

	model.eval()
	
	inputs_masked = input + " " + "[MASK][MASK][MASK][MASK]-[MASK][MASK]-[MASK][MASK]"
	
	input_ids = tokenizer.encode(inputs_masked)
	
	input_ids = torch.tensor(input_ids).to(deivce)
	
	with torch.no_grad():
	
	logits = model(input_ids.unsqueeze(0))
	
	logits = logits.flatten(0, 1)
	
	probs = torch.argmax(logits,dim=-1,keepdim=True)
	
	token_ids = probs.squeeze(1)
	
	answer_ids = token_ids[-11:-1]

return token_ids_to_text(answer_ids,tokenizer)
```

And use:
```python
predict_masked(model,tokenizer,"12 آبان 1402","cuda")
```
Output: 
```
'1402-08-12'
```
## Limitations

- The model currently only supports Persian dates for the year 1400-1410, with potential for expansion.
- Performance may vary with dates outside the training dataset.

## Intended Use

This model is intended for applications requiring date recognition and generation in Persian, such as natural language processing tasks, chatbots, or educational tools.

## Acknowledgements

- Special thanks to the developers of the “بلبل زبان” tokenizer and the contributors to the dataset.