Spaces:
Runtime error
Runtime error
Simon Sorg
commited on
Commit
•
8cfbd9e
1
Parent(s):
dc6b696
feat: add readme
Browse files
README.md
CHANGED
@@ -12,25 +12,73 @@ pinned: false
|
|
12 |
|
13 |
# Metric Card for Valid Efficiency Score
|
14 |
|
15 |
-
***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
|
16 |
-
|
17 |
## Metric Description
|
18 |
-
|
|
|
19 |
|
20 |
## How to Use
|
21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
22 |
|
23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
24 |
|
25 |
### Inputs
|
26 |
-
|
27 |
-
- **
|
|
|
|
|
|
|
28 |
|
29 |
### Output Values
|
|
|
30 |
|
31 |
-
*Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
|
32 |
-
|
33 |
-
*State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
|
34 |
|
35 |
#### Values from Popular Papers
|
36 |
*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
|
@@ -39,10 +87,13 @@ pinned: false
|
|
39 |
*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
|
40 |
|
41 |
## Limitations and Bias
|
42 |
-
|
|
|
43 |
|
44 |
## Citation
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
|
|
|
|
|
12 |
|
13 |
# Metric Card for Valid Efficiency Score
|
14 |
|
|
|
|
|
15 |
## Metric Description
|
16 |
+
This metric measures the efficiency of the SQL queries generated by a model. It is defined as the ratio of the number of correct results to the number of SQL queries generated. The metric is computed by executing the SQL queries and comparing the results to the expected results.
|
17 |
+
It is used for the BIRD benchmark.
|
18 |
|
19 |
## How to Use
|
20 |
+
```
|
21 |
+
from evaluate import load
|
22 |
+
|
23 |
+
module = load("Luckiestone/valid_efficiency_score")
|
24 |
+
|
25 |
+
results = module.compute(predictions=sql_queries_pred, references=sql_queries_ref, execute=execute)
|
26 |
+
print(results)
|
27 |
+
>>> {"ves": 1.0}
|
28 |
+
```
|
29 |
+
### Example
|
30 |
+
```
|
31 |
+
from evaluate import load
|
32 |
+
import sqlite3
|
33 |
+
|
34 |
+
module = load("Luckiestone/valid_efficiency_score")
|
35 |
+
|
36 |
+
# Create connection to the database
|
37 |
+
database_path = "database.sqlite"
|
38 |
+
connection = sqlite3.connect(database_path)
|
39 |
+
# Cursor
|
40 |
+
cursor = connection.cursor()
|
41 |
+
|
42 |
+
# Create table
|
43 |
+
cursor.execute('''CREATE TABLE IF NOT EXISTS Player
|
44 |
+
(PlayerID INTEGER PRIMARY KEY,
|
45 |
+
PlayerName TEXT NOT NULL);''')
|
46 |
|
47 |
+
# Insert a row of data
|
48 |
+
cursor.execute("INSERT INTO Player VALUES (1, 'Cristiano Ronaldo')")
|
49 |
+
cursor.execute("INSERT INTO Player VALUES (2, 'Lionel Messi')")
|
50 |
+
|
51 |
+
def execute(sql_query):
|
52 |
+
# Execute the SQL query
|
53 |
+
cursor.execute(sql_query)
|
54 |
+
result = cursor.fetchall()
|
55 |
+
return result
|
56 |
+
|
57 |
+
sql_queries_pred = [
|
58 |
+
"SELECT COUNT(*) FROM Player WHERE PlayerName = 'Cristiano Ronaldo'",
|
59 |
+
"SELECT COUNT(*) FROM Player WHERE PlayerName = 'Lionel Messi'"
|
60 |
+
]
|
61 |
+
|
62 |
+
sql_queries_ref = [
|
63 |
+
"SELECT COUNT(*) FROM Player WHERE PlayerName = 'Cristiano Ronaldo'",
|
64 |
+
"SELECT COUNT(*) FROM Player WHERE PlayerName = 'Lionel Messi'"
|
65 |
+
]
|
66 |
+
|
67 |
+
# Compute the score
|
68 |
+
results = module.compute(predictions=sql_queries_pred, references=sql_queries_ref, execute=execute,)
|
69 |
+
print(results)
|
70 |
+
```
|
71 |
|
72 |
### Inputs
|
73 |
+
- **predictions** *(string): SQL queries generated.*
|
74 |
+
- **references** *(string): SQL queries from the test set.*
|
75 |
+
- **execute** *(callable): Function that executes the SQL queries and returns the results.*
|
76 |
+
- **filter_func** *(callable, optional): Function that filters the results of the SQL queries.*
|
77 |
+
- **num_executions** *(int, optional): Number of times to execute each SQL query.*
|
78 |
|
79 |
### Output Values
|
80 |
+
- **ves** *(float): Valid efficiency score.* Higher scores are better. Technically ranges from 0 to 1, but if the predictions are exactly accurate and, due to some jittering, the time to execute the predictions is smaller than the time to execute the references, the score can be greater than 1.
|
81 |
|
|
|
|
|
|
|
82 |
|
83 |
#### Values from Popular Papers
|
84 |
*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
|
|
|
87 |
*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
|
88 |
|
89 |
## Limitations and Bias
|
90 |
+
The metric is limited to SQL queries. It is also quite slow to compute, as it requires executing the SQL queries.
|
91 |
+
Furthermore, the results are non-deterministic, as the time to execute the SQL queries can vary, even though we average over multiple executions.
|
92 |
|
93 |
## Citation
|
94 |
+
@article{li2023can,
|
95 |
+
title={Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls},
|
96 |
+
author={Li, Jinyang and Hui, Binyuan and Qu, Ge and Li, Binhua and Yang, Jiaxi and Li, Bowen and Wang, Bailin and Qin, Bowen and Cao, Rongyu and Geng, Ruiying and others},
|
97 |
+
journal={arXiv preprint arXiv:2305.03111},
|
98 |
+
year={2023}
|
99 |
+
}
|