Spaces:
Running
Running
File size: 2,578 Bytes
b247dc4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
# DuckDB-NSQL
Numbers Station Text to SQL model for DuckDB.
NSQL is a family of autoregressive open-source foundational models (FMs) that are particularly designed for SQL generation tasks. We are thrilled to introduce DuckDB-NSQL in this repository, an FM tailored for local DuckDB SQL analytics tasks. All model weights can be found on HuggingFace.
| Model Name | Size | Link |
| --------------------------------------| ---- | -------------------------------------------------------------- |
| motherduckdb/DuckDB-NSQL-7B-v0.1 | 7B | [link](https://huggingface.co/motherduckdb/DuckDB-NSQL-7B-v0.1) |
| motherduckdb/DuckDB-NSQL-7B-v0.1-GGUF | 7B | [link](https://huggingface.co/motherduckdb/DuckDB-NSQL-7B-v0.1-GGUF)|
## Setup
To install all the necessary dependencies, please run
```
pip install -r requirements.txt
```
## Usage
Please refer to the examples in the `examples/` folder to learn how to connect to a local DuckDB and directly query your data. A simple notebook is provided in the `examples/` directory for reference.
To host the model with llama.cpp, please execute the following:
```python
# Import necessary modules
from llama_cpp import Llama
from wurlitzer import pipes
# Set up client with model path and context size
with pipes() as (out, err):
client = Llama(
model_path="DuckDB-NSQL-7B-v0.1-q8_0.gguf",
n_ctx=2048,
)
```
To load the DuckDB database and query against it, please execute the following:
```python
# Import necessary modules
import duckdb
from utils import generate_sql
# Connect to DuckDB database
con = duckdb.connect("nyc.duckdb")
# Sample question for SQL generation
question = "alter taxi table and add struct column with name test and keys a:int, b:double"
# Generate SQL, check validity, and print
sql = generate_sql(question, con, client)
print(sql)
```
## Training Data
The training data for this model consists of two parts: 1) 200k synthetically generated DuckDB SQL queries, based on the DuckDB v.0.9.2 documentation, and 2) labeled text-to-SQL pairs from [NSText2SQL](https://huggingface.co/datasets/NumbersStation/NSText2SQL) transpiled to DuckDB SQL using [sqlglot](https://github.com/tobymao/sqlglot).
## Evaluate the benchmark
Please refer to the `eval/` folder to check the details for evaluating the model against our proposed DuckDB benchmark.
## Acknowledgement
We would like to express our appreciation to all authors of the evaluation scripts. Their work made this project possible.
|