metadata

title: Valid Efficiency Score
tags:
  - evaluate
  - metric
description: 'TODO: add a description here'
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false

Metric Card for Valid Efficiency Score

Metric Description

This metric measures the efficiency of the SQL queries generated by a model. It is defined as the ratio of the number of correct results to the number of SQL queries generated. The metric is computed by executing the SQL queries and comparing the results to the expected results. It is used for the BIRD benchmark.

How to Use

from evaluate import load

module = load("Luckiestone/valid_efficiency_score")

results = module.compute(predictions=sql_queries_pred, references=sql_queries_ref, execute=execute)
print(results)
>>> {"ves": 1.0}

Example

from evaluate import load
import sqlite3

module = load("Luckiestone/valid_efficiency_score")

# Create connection to the database
database_path = "database.sqlite"
connection = sqlite3.connect(database_path)
# Cursor
cursor = connection.cursor()

# Create table
cursor.execute('''CREATE TABLE IF NOT EXISTS Player
             (PlayerID INTEGER PRIMARY KEY,
             PlayerName TEXT NOT NULL);''')

# Insert a row of data
cursor.execute("INSERT INTO Player VALUES (1, 'Cristiano Ronaldo')")
cursor.execute("INSERT INTO Player VALUES (2, 'Lionel Messi')")

def execute(sql_query):
    # Execute the SQL query
    cursor.execute(sql_query)
    result = cursor.fetchall()
    return result
    
sql_queries_pred = [
    "SELECT COUNT(*) FROM Player WHERE PlayerName = 'Cristiano Ronaldo'",
    "SELECT COUNT(*) FROM Player WHERE PlayerName = 'Lionel Messi'"
]

sql_queries_ref = [
    "SELECT COUNT(*) FROM Player WHERE PlayerName = 'Cristiano Ronaldo'",
    "SELECT COUNT(*) FROM Player WHERE PlayerName = 'Lionel Messi'"
]

# Compute the score
results = module.compute(predictions=sql_queries_pred, references=sql_queries_ref, execute=execute,)
print(results)

Inputs

predictions (string): SQL queries generated.
references (string): SQL queries from the test set.
execute (callable): Function that executes the SQL queries and returns the results.
filter_func (callable, optional): Function that filters the results of the SQL queries.
num_executions (int, optional): Number of times to execute each SQL query.

Output Values

ves (float): Valid efficiency score. Higher scores are better. Technically ranges from 0 to 1, but if the predictions are exactly accurate and, due to some jittering, the time to execute the predictions is smaller than the time to execute the references, the score can be greater than 1.

Values from Popular Papers

Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.

Examples

Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.

Limitations and Bias

The metric is limited to SQL queries. It is also quite slow to compute, as it requires executing the SQL queries. Furthermore, the results are non-deterministic, as the time to execute the SQL queries can vary, even though we average over multiple executions.

Citation

@article{li2023can, title={Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls}, author={Li, Jinyang and Hui, Binyuan and Qu, Ge and Li, Binhua and Yang, Jiaxi and Li, Bowen and Wang, Bailin and Qin, Bowen and Cao, Rongyu and Geng, Ruiying and others}, journal={arXiv preprint arXiv:2305.03111}, year={2023} }