Spaces:
Running
title: DGEB
app_file: leaderboard/app.py
sdk: docker
sdk_version: 4.36.1
Diverse Genomic Embedding Benchmark
Installation | Usage | Leaderboard | Citing
DGEB is a benchmark for evaluating biological sequence models on functional and evolutionary information.
DGEB is designed to evaluate model embeddings using:
- Diverse sequences accross the tree of life.
- Diverse tasks that capture different aspects of biological function.
- Both amino acid and nucleotide sequences.
The current version of DGEB consists of 18 datasets covering all three domains of life (Bacteria, Archaea and Eukarya). DGEB evaluates embeddings using six different embedding tasks: Classification, BiGene mining, Evolutionary Distance Similarity (EDS), Pair Classification, Clustering, and Retrieval.
We welcome contributions of new tasks and datasets.
Installation
Install DGEB using pip.
pip install dgeb
Usage
- Launch evaluation using the python script (see cli.py):
dgeb --model facebook/esm2_t6_8M_UR50D
- To see all supported models and tasks:
dgeb --help
- Using the python API:
import dgeb
model = dgeb.get_model("facebook/esm2_t6_8M_UR50D")
tasks = dgeb.get_tasks_by_modality(dgeb.Modality.PROTEIN)
evaluation = dgeb.DGEB(tasks=tasks)
evaluation.run(model, output_folder="results")
Using a custom model
Custom models should be wrapped with the dgeb.models.BioSeqTransformer
abstract class, and specify the modality, number of layers, and embedding dimension. See models.py for additional examples on custom model loading and inference.
import dgeb
from dgeb.models import BioSeqTransformer
from dgeb.tasks.tasks import Modality
class MyModel(BioSeqTransformer):
@property
def modality(self) -> Modality:
return Modality.PROTEIN
@property
def num_layers(self) -> int:
return self.config.num_hidden_layers
@property
def embed_dim(self) -> int:
return self.config.hidden_size
model = MyModel(model_name='path_to/huggingface_model')
tasks = dgeb.get_tasks_by_modality(model.modality)
evaluation = dgeb.DGEB(tasks=tasks)
evaluation.run(model)
Evaluating on a custom dataset
We strongly encourage users to contribute their custom datasets to DGEB. Please open a PR adding your dataset so that the community can benefit!
To evaluate on a custom dataset, first upload your dataset to the Huggingface Hub. Then define a Task
subclass with TaskMetadata
that points to your huggingface dataset. For example, a classification task on a custom dataset can be defined as follows:
import dgeb
from dgeb.models import BioSeqTransformer
from dgeb.tasks import Dataset, Task, TaskMetadata, TaskResult
from dgeb.tasks.classification_tasks import run_classification_task
class MyCustomTask(Task):
metadata = TaskMetadata(
id="my_custom_classification",
display_name="...",
description="...",
type="classification",
modality=Modality.PROTEIN,
datasets=[
Dataset(
path="path_to/huggingface_dataset",
revision="...",
)
],
primary_metric_id="f1",
)
def run(self, model: BioSeqTransformer) -> TaskResult:
return run_classification_task(model, self.metadata)
model = dgeb.get_model("facebook/esm2_t6_8M_UR50D")
evaluation = dgeb.DGEB(tasks=[MyCustomTask])
evaluation.run(model)
Leaderboard
To add your submission to the DGEB leaderboard, proceed through the following instructions.
Fork the DGEB repository by following GitHub's instruction Forking Workflow.
Add your submission .json file to the leaderboard/submissions// directory.
mv /path/to/<SUBMISSION_FILE>.json /path/to/DGEB/leaderboard/submissions/<HF_MODEL_NAME>/
- Update your fork with the new submission:
git add leaderboard/submissions/<HF_MODEL_NAME>/<SUBMISSION_FILE>.json
git commit -m "Add submission for <HF_MODEL_NAME>"
git push
Open a pull request to the main branch of the repository via the Github interface.
Once the PR is review and merged, your submission will be added to the leaderboard!
Acknowledgements
DGEB follows the design of text embedding bechmark MTEB developed by Huggingface 🤗. The evaluation code is adapted from the MTEB codebase.
Citing
DGEB was introduced in "Diverse Genomic Embedding Benchmark for Functional Evaluation Across the Tree of Life", feel free to cite:
TODO