Updated metadata

888b7c5 verified 7 days ago

7.29 kB

	---
	base_model: t5-small
	library_name: transformers
	license: apache-2.0
	tags:
	- generated_from_trainer
	model-index:
	- name: en-af-sql-training-1727527893
	results: []
	datasets:
	- b-mc2/sql-create-context
	- Clinton/Text-to-sql-v1
	- knowrohit07/know_sql
	language:
	- af
	- en
	pipeline_tag: text2text-generation
	metrics:
	- Exact Match
	- TSED (Tree Similarity of Editing Distance)
	- SQAM (SQL Query Analysis Metric)
	- BLEU score
	---

	# en-af-sql-training-1727527893

	This model is a fine-tuned version of [t5-small](https://huggingface.co/t5-small) on three datasets: b-mc2/sql-create-context, Clinton/Text-to-sql-v1, knowrohit07/know-sql.
	It achieves the following results on the evaluation set:
	- Loss: 0.0210

	## Model description

	This is a fine-tuned Afrikaans-to-SQL model. The pretrained [t5-small](https://huggingface.co/t5-small) was used to train our SQL model.

	## Training and Evaluation Datasets

	As mentioned, to train the model we used a combination of three dataset which we split into training, testing, and validation sets. THe dataset can be found by following these links:

	- [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context)
	- [Clinton/Text-to-sql-v1](https://huggingface.co/datasets/Clinton/Text-to-sql-v1)
	- [knowrohit07/know-sql](https://huggingface.co/datasets/knowrohit07/know_sql)

	We did a 80-10-10 split on each dataset and then combined them into a single `DatasetDict` object with `train`, `test,` and `validation` sets.
	```json
	DatasetDict({
	train: Dataset({
	features: ['answer', 'question', 'context', 'afr question'],
	num_rows: 118692
	})
	test: Dataset({
	features: ['answer', 'question', 'context', 'afr question'],
	num_rows: 14838
	})
	validation: Dataset({
	features: ['answer', 'question', 'context', 'afr question'],
	num_rows: 14838
	})
	})
	```

	The pretrained model was then fine-tuned on the dataset splits. Rather than using only the `question`, the model also takes in the schema context such that it can generate more accurate queries for a given database.

	Input prompt
	```python
	Table context: CREATE TABLE table_55794 (
	"Home team" text,
	"Home team score" text,
	"Away team" text,
	"Away team score" text,
	"Venue" text,
	"Crowd" real,
	"Date" text
	)
	Question: Watter tuisspan het'n span mebbourne?
	Answer:
	```
	Expected Output
	```sql
	SELECT "Home team score" FROM table_55794 WHERE "Away team" = 'melbourne'
	```

	## Intended uses & limitations

	This model takes in a single prompt (similar to the one above) that is tokenized and it then uses the `input_ids` to generate an output SQL query. However the prompt must be structured in a specific way.

	The `prompt` must start with the table/schema description followed by the question followed by an empty answer. Below we illustrate an example on how to use it. Furthermore, our combined dataset looks as follows:

	Tokenized Dataset
	```json
	DatasetDict({
	train: Dataset({
	features: ['input_ids', 'labels'],
	num_rows: 118692
	})
	test: Dataset({
	features: ['input_ids', 'labels'],
	num_rows: 14838
	})
	validation: Dataset({
	features: ['input_ids', 'labels'],
	num_rows: 14838
	})
	})
	```
	Usage
	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, TrainingArguments
	# Load the model and tokenizer from Hugging Face Hub
	repo_name = "JsteReubsSoftware/en-af-sql-training-1727527893"
	en_af_sql_model = AutoModelForSeq2SeqLM.from_pretrained(repo_name, torch_dtype=torch.bfloat16)
	en_af_sql_model = en_af_sql_model.to('cuda')
	tokenizer = AutoTokenizer.from_pretrained(repo_name)

	question = "Watter tuisspan het'n span mebbourne?"
	context = "CREATE TABLE table_55794 (
	"Home team" text,
	"Home team score" text,
	"Away team" text,
	"Away team score" text,
	"Venue" text,
	"Crowd" real,
	"Date" text
	)"

	prompt = f"""Tables:
	{context}

	Question:
	{question}

	Answer:
	"""
	inputs = tokenizer(prompt, return_tensors='pt')
	inputs = inputs.to('cuda')

	output = tokenizer.decode(
	en_af_sql_model.generate(
	inputs["input_ids"],
	max_new_tokens=200,
	)[0],
	skip_special_tokens=True
	)

	print("Predicted SQL Query:")
	print(output)
	```

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 0.005
	- train_batch_size: 32
	- eval_batch_size: 32
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 2

	We used the following in our program:
	```python
	output_dir = f'./en-af-sql-training-{str(int(time.time()))}'

	training_args = TrainingArguments(
	output_dir=output_dir,
	learning_rate=5e-3,
	num_train_epochs=2,
	per_device_train_batch_size=16, # batch size per device during training
	per_device_eval_batch_size=16, # batch size for evaluation
	weight_decay=0.01,
	logging_steps=50,
	evaluation_strategy='steps', # evaluation strategy to adopt during training
	eval_steps=500, # number of steps between evaluation
	)

	trainer = Trainer(
	model=finetuned_model,
	args=training_args,
	train_dataset=tokenized_datasets['train'],
	eval_dataset=tokenized_datasets['validation'],
	)
	```

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \|
	\|:-------------:\|:------:\|:----:\|:---------------:\|
	\| 0.0573 \| 0.1348 \| 500 \| 0.0452 \|
	\| 0.0424 \| 0.2695 \| 1000 \| 0.0364 \|
	\| 0.037 \| 0.4043 \| 1500 \| 0.0323 \|
	\| 0.0356 \| 0.5391 \| 2000 \| 0.0287 \|
	\| 0.0328 \| 0.6739 \| 2500 \| 0.0269 \|
	\| 0.0281 \| 0.8086 \| 3000 \| 0.0255 \|
	\| 0.0286 \| 0.9434 \| 3500 \| 0.0238 \|
	\| 0.0269 \| 1.0782 \| 4000 \| 0.0233 \|
	\| 0.0247 \| 1.2129 \| 4500 \| 0.0225 \|
	\| 0.0245 \| 1.3477 \| 5000 \| 0.0217 \|
	\| 0.0226 \| 1.4825 \| 5500 \| 0.0214 \|
	\| 0.0245 \| 1.6173 \| 6000 \| 0.0211 \|
	\| 0.024 \| 1.7520 \| 6500 \| 0.0210 \|
	\| 0.0249 \| 1.8868 \| 7000 \| 0.0210 \|

	### Testing results

	After our model was trained and validated, we evaluated the model using four evaluation metrics.

	- Exact Match Accuracy: This measured the accuracy of our model predicting the exact same SQL query as the target query.
	- TSED score: This metric ranges from 0 to 1 and was proposed by [this](https://dl.acm.org/doi/abs/10.1145/3639477.3639732) paper. It allows us to estimate the execution performance of the output query, allowing us to estimate the model's execution accuracy.
	- SQAM accuracy: Similar to TSED, we can used this to estimate the output query's execution accuracy (also see [this](https://dl.acm.org/doi/abs/10.1145/3639477.3639732) paper).
	- BLEU score: This helps us measure the similarity between the output query and the target query.

	The following were the obtained results over the testing set (14838 records):

	- Exact Match = 35.98 %
	- TSED score: 0.897
	- SQAM score: 74.31 %
	- BLEU score: 0.762

	### Framework versions

	- Transformers 4.44.2
	- Pytorch 2.4.0
	- Datasets 3.0.0
	- Tokenizers 0.19.1