File size: 8,282 Bytes
1468963
 
 
 
e45b168
9ef0558
1468963
 
 
29c1bb4
 
fb7306c
e45b168
 
dbd532c
1468963
 
 
 
 
9e92530
1468963
92ee040
1468963
 
 
35cfefe
1468963
 
 
35cfefe
1468963
d8d659d
 
 
 
 
 
11a7eb8
 
 
d8d659d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1468963
 
35cfefe
1468963
 
 
35cfefe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1468963
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ab663b0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1468963
 
 
 
 
 
29c1bb4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
base_model: ybelkada/falcon-7b-sharded-bf16
tags:
- generated_from_trainer
- lora
- falcon
model-index:
- name: results
  results: []
datasets:
- Clinton/Text-to-sql-v1
library_name: peft
language:
- en
pipeline_tag: text-generation
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# AI2sql

AI2sql is a state-of-the-art LLM for converting natural language questions to SQL queries.

## Model description

AI2SQL is a specialized LLM fine-tuned from Falcon-7b-instruct with PEFT- LoRA technology, tailored for interpreting natural language and generating corresponding SQL queries.

## Intended uses & limitations

AI2SQL is designed for data analysts, business intelligence professionals, and developers to facilitate the conversion of natural language questions into SQL queries. This tool aids those who are not proficient in SQL, enabling easier database querying. AI2SQL's performance is inherently tied to the characteristics of its training data. While it has been trained on a diverse and substantial dataset, it may not account for all possible SQL dialects or database structures. Careful review of the generated SQL queries is recommended.

## Inference

### Model Deployment
AI2SQL is designed for efficient real-time inference, making it suitable for interactive applications where users query databases using natural language.

### Computational Requirements
- **Hardware Requirements**: AI2SQL performs optimally on a range of CPUs, showing satisfactory performance on an A10 processor. For enhanced performance, particularly for more complex queries, the use of a high-end CPU or GPU is recommended.
- **Memory Footprint**: The model requires at least 14 GB of RAM for inference.
- **Latency**: The average response time for generating a SQL query is dependent on the hardware used and the complexity of the query. On an A10 processor, latency is satisfactory, with potential for faster response times on more advanced hardware.

### Usage Guidelines
To use AI2SQL for generating SQL queries, follow these steps:

1. **Preparation**: Ensure that your system meets the hardware and software requirements for running the model.
2. **Input Formatting**: Format your natural language questions clearly and concisely for best results.
3. **Model Invocation**: Call the AI2SQL model with the natural language question as input. The model returns the corresponding SQL query as output.

### Example Code for Inference
```python
from transformers import pipeline

# Initialize the AI2SQL model
ai2sql = pipeline('text-to-sql', model='ai2sql')

# Example natural language question
question = "How many products were sold last month?"

# Generate the SQL query
sql_query = ai2sql(question)
print("Generated SQL Query:", sql_query)
```


### Scalability
AI2SQL is scalable and can handle concurrent requests, making it suitable for deployment in high-demand environments.

### Error Handling
The model includes robust error handling for invalid inputs and provides meaningful error messages to guide users in correcting their queries.

### Security Considerations
Users should be aware of security implications when using AI2SQL, especially when dealing with sensitive data or integrating the model into secure environments. Ensure all data handling complies with relevant privacy and security regulations.



## Training and evaluation data

Trained on a comprehensive dataset comprising 262,000 rows of paired natural language questions and SQL queries sourced from Text-to-SQL Dataset, covering a wide array of domains and question complexities.

## Training procedure

### Overview
AI2SQL was trained in a multi-stage process, starting with a pre-trained Falcon-7b-instruct model, a large transformer-based language model. This base model was then fine-tuned using a Parameter Efficient Fine-Tuning (PEFT) approach with Locally Reweighted Approximations (LoRA) specifically for the task of translating natural language to SQL queries.

### Data Preparation
The training dataset, sourced from the [Text-to-SQL Dataset](https://huggingface.co/datasets/Clinton/Text-to-sql-v1), included 262,000 rows of paired natural language questions and SQL queries. Each pair consists of a natural language question and its corresponding SQL query, covering a diverse range of domains and query complexities.

### Fine-Tuning Process
1. **Data Preprocessing**: The dataset was preprocessed to normalize text and SQL queries, ensuring consistency in formatting and syntax.
2. **Model Adaptation**: The Falcon-7b-instruct model was adapted using PEFT- LoRA, a technique that allows for efficient and targeted updates to the model's weights without extensive retraining. This approach is particularly beneficial for adapting large-scale models to specific tasks with limited computational resources.
3. **Training Strategy**: The model was trained in a supervised learning setup, where it learned to map natural language inputs to their corresponding SQL queries. Special attention was given to the model's ability to understand the semantics of the natural language questions and accurately reflect them in SQL syntax.
4. **Validation and Testing**: Throughout the training process, the model was periodically evaluated on a held-out validation set to monitor its performance and prevent overfitting. The final model was tested on an independent test set to assess its generalization capabilities.

### Model Evaluation
The model's performance was evaluated based on its accuracy in generating correct SQL queries corresponding to the input natural language questions. Metrics such as precision, recall, and F1-score were used to quantify the model's effectiveness.

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0002
- train_batch_size: 4
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 16
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: constant
- lr_scheduler_warmup_ratio: 0.03
- training_steps: 500
- mixed_precision_training: Native AMP

### Training results

### Performance Metrics
AI2SQL's performance was rigorously evaluated post-training. The key metrics used to assess the model were:

- **Accuracy**: The percentage of queries where the model-generated SQL matched the expected SQL.
- **Precision**: The proportion of correctly generated SQL queries out of all queries generated by the model.
- **Recall**: The ability of the model to generate all relevant SQL queries corresponding to the input natural language questions.
- **F1-Score**: The harmonic mean of precision and recall, providing a balance between the two.

**Results:**
- Accuracy: TBD
- Precision: TBD
- Recall: TBD
- F1-Score: TBD

### Insights and Observations
- **Handling Complex Queries**: AI2SQL demonstrated a high proficiency in handling complex queries involving multiple SQL clauses and parameters.
- **Contextual Understanding**: The model showed a notable capability in understanding context and generating SQL queries that accurately reflect nuanced natural language instructions.
- **Performance on Diverse Data**: AI2SQL maintained consistent performance across various domains present in the training dataset, indicating its robustness and general applicability.

### Limitations Observed
- **Handling Ambiguous Questions**: The model sometimes struggled with ambiguous natural language inputs where the intent was not clear.
- **Query Specificity**: In cases of highly specific queries, the model occasionally generated SQL that was syntactically correct but did not completely align with the nuanced requirements of the question.

### Future Improvements
Based on the training results and observed limitations, future improvements could include:
- Enhanced training on ambiguous natural language inputs to improve the model's interpretative capabilities.
- Further fine-tuning with a broader range of specific and complex SQL queries to enhance the model's accuracy in generating nuanced SQL statements.

### Framework versions

- Transformers 4.35.2
- Pytorch 2.1.0+cu118
- Datasets 2.15.0
- Tokenizers 0.15.0