dataeaze
/

dataeaze-text2sql-codellama_7b_instruct-clinton_text_to_sql_v1

+---
+license: cc-by-nc-sa-4.0
+language:
+- en
+library_name: transformers
+tags:
+- text-to-sql
+- text2sql
+- nlp2sql
+- nlp-to-sql
+- SQL
+---
+# Model Card for text2sql
+<!-- Provide a quick summary of what the model is/does. -->
+LLM instruction finetuned for Text-to-SQL task.
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [dataeaze systems pvt ltd](https://www.dataeaze.io/)
+- **Funded by :** [dataeaze systems pvt ltd](https://www.dataeaze.io/)
+- **Shared by :** [dataeaze systems pvt ltd](https://www.dataeaze.io/)
+- **Model type:** LlamaForCausalLM
+- **Language(s) (NLP):** English
+- **License:** [cc-by-nc-sa-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en) Model is made available under non-commercial use for research purposes only. For commercial usage please connect at [email protected]
+- **Finetuned from model :** [CodeLlama-7b-Instruct-hf](https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf)
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+Model can be used a tool to convert queries in expressed in natural language (English) to SQL statements
+### Downstream Use
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+The model could be used as the initial stage in a data analytics / business intelligence application pipeline.
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+Model has been fine tuned on a specific task of converting English language statements to SQL queries.
+Any use beyond this is not guaranteed to be accurate.
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+- **Bias:** Trained for English language only.
+- **Risk:** Guardrails are reliant on the base models CodeLlama (Llama2). Finetuning could impact this behaviour.
+- **Limitations:** Intended to be a small model optimised for inference. Does not provide SoTA results on accuracy.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+```
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained(
+                                "dataeaze/dataeaze-text2sql-codellama_7b_instruct-clinton_text_to_sql_v1",
+                                torch_dtype=torch.bfloat16,
+                                device_map='auto'
+                                )
+tokenizer = AutoTokenizer.from_pretrained("dataeaze/dataeaze-text2sql-codellama_7b_instruct-clinton_text_to_sql_v1")
+# print("model device :", model.device)
+tokenizer.pad_token = tokenizer.eos_token
+model.eval()
+prompt = """ Below are sql tables schemas paired with instruction that describes a task.
+Using valid SQLite, write a response that appropriately completes the request for the provided tables.
+### Instruction: How many transactions were made by a customer in a specific month?
+### Database: RewardsProgramDB61
+### Input:
+CREATE SCHEMA RewardsProgram;
+CREATE TABLE Customer (
+    CustomerID INT NOT NULL AUTO_INCREMENT,
+    FirstName VARCHAR(50) NOT NULL,
+    LastName VARCHAR(50) NOT NULL,
+    Email VARCHAR(100) UNIQUE NOT NULL,
+    Phone VARCHAR(20) UNIQUE,
+    DateOfBirth DATE,
+    PRIMARY KEY (CustomerID)
+);
+CREATE TABLE Membership (
+    MembershipID INT NOT NULL AUTO_INCREMENT,
+    MembershipType VARCHAR(50) NOT NULL,
+    DiscountPercentage DECIMAL(5, 2) NOT NULL,
+    ValidFrom DATETIME,
+    ValidTo DATETIME,
+    CustomerID INT NOT NULL,
+    PRIMARY KEY (MembershipID),
+    FOREIGN KEY (CustomerID) REFERENCES Customer(CustomerID)
+);
+CREATE TABLE Transaction (
+    TransactionID INT NOT NULL AUTO_INCREMENT,
+    TransactionDate TIMESTAMP,
+    TotalAmount DECIMAL(10, 2) NOT NULL,
+    CustomerID INT NOT NULL,
+    PRIMARY KEY (TransactionID),
+    FOREIGN KEY (CustomerID) REFERENCES Customer(CustomerID)
+);
+CREATE TABLE TransactionDetail (
+    TransactionDetailID INT NOT NULL AUTO_INCREMENT,
+    TransactionID INT NOT NULL,
+    ProductID INT NOT NULL,
+    Quantity INT NOT NULL,
+    UnitPrice DECIMAL(10, 2) NOT NULL,
+    PRIMARY KEY (TransactionDetailID),
+    FOREIGN KEY (TransactionID) REFERENCES Transaction(TransactionID),
+    FOREIGN KEY (ProductID) REFERENCES Product(ProductID)
+);
+CREATE TABLE Product (
+    ProductID INT NOT NULL AUTO_INCREMENT,
+    ProductName VARCHAR(100) NOT NULL,
+    UnitPrice DECIMAL(10, 2) NOT NULL,
+    AvailableQuantity INT NOT NULL,
+    CreatedDate DATETIME,
+    PRIMARY KEY (ProductID)
+);
+ALTER TABLE Membership ADD CONSTRAINT FK_Membership_Customer FOREIGN KEY (CustomerID) REFERENCES Customer(CustomerID);
+ALTER TABLE TransactionDetail ADD CONSTRAINT FK_TransactionDetail_Transaction FOREIGN KEY (TransactionID) REFERENCES Transaction(TransactionID);
+ALTER TABLE TransactionDetail ADD CONSTRAINT FK_TransactionDetail_Product FOREIGN KEY (ProductID) REFERENCES Product(ProductID);"
+"""
+input_ids = tokenizer(prompt, padding=True, return_tensors='pt')
+outputs = model.generate(
+    input_ids=input_ids['input_ids'].to(model.device),
+    attention_mask=input_ids['attention_mask'].to(model.device),
+    max_new_tokens=3072,
+)
+generated_query = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(generated_query)
+```
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[SPIDER dataset Test Set](https://yale-lily.github.io/spider)
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+SQL queries are matched against the correct answer, with two types of evaluation
+* Execution with Values
+* Exact Set Match without Values
+### Results
+```
+model-index:
+  - name: dataeaze/dataeaze-text2sql-codellama_7b_instruct-dzsql
+    results:
+    - task:
+        type: text-to-sql
+      dataset:
+        name: SPIDER 1.0
+        type: text-to-sql
+      metrics:
+        - name: Execution with Values
+          type: Execution with Values
+          value: 64.3
+        - name: Exact Set Match without Values
+          type: Exact Set Match without Values
+          value: 29.6
+      source:
+        name: Spider 1.0 - Leaderboard
+        url: https://yale-lily.github.io/spider
+```
+## Model Card Authors
+* Suyash Chougule
+* Chittaranjan Rathod
+* Sourabh Daptardar
+## Model Card Contact
+"dataeaze systems" <[email protected]>