LofiAmazonSpace / README.md
jennzhuge's picture
Update README.md
09685b7 verified
|
raw
history blame
5.12 kB
---
title: DNA Identifier Tool
emoji: πŸ“‰
colorFrom: green
colorTo: yellow
sdk: gradio
sdk_version: 4.32.1
app_file: app.py
pinned: false
---
# Welcome to Lofi Amazon Rainforest Beats to Hack/AI to's DNA Identifier Tool.
This tool is intended to help conservationists/biologists identify unmatched eDNA samples or verify known samples by predicting genus from DNA sequences. If unsure, the tool can also visualize the DNA embedding space to help one hypothesize about which species the sequence could belong to.
Conserving and monitoring biodiversity is crucial but challenging, especially in remote and densely vegetated areas. Current methods like camera traps and bioacoustic monitoring require processing huge stores of video/audio feeds. Additionally, satellite imagery analysis is challenging in areas with constant cloud cover or dense canopy cover which may obscure the true conditions on the ground. Found in various states of decay within water, soil, or sediment, DNA can last from a few hours in temperate waters to millennia in cold, dry permafrost. These so called environmental DNA (eDNA) samples allow for the direct extraction of DNA without any traces of the organism itself, offering a much less labor intense way to monitor biodiversity-- that is if the DNA sequences can be identified.
The current method to identify eDNA sequences involves searching for a match within 2-3% difference in an incomplete reference libary comprising direct specimen DNA samples (BOLD). However, many species are elusive or living in inacessible regions making direct sampling infeasible. We attempt to overcome the limitations of traditional species identification by using ecological layer data and environmental eDNA. We hypothesize that besides pure DNA similarity, there may be knowledge about the area in which a sequence was found that can give clues as to what the sequence could be. We introduce the largest DNA barcode model, trained on a global dataset of over five million sequences gathered from the Barcode of Life Data System([Ratnasingham and Hebert, 2007](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1890991/)), and a comprehensive dataset from the Amazon rainforest, including DNA sequences and ecological layer data describing the coordinates where each sequence was sampled. The layers we use describe Annual Mean Air Temperature, Temperature Seasonality, Annual Precipitation, Precipitation Seasonality, Human Footprint, Elevation, and Population Density.
Our findings show that our DNA model clusters species effectively in the embedding space with modest training. Additionally, incorporating ecological layer data improves accuracy in genus classification tasks. This integration of eDNA and ecological layers offers a scalable method for advanced biodiversity analysis and conservation.
## Genus Prediction
Enter a DNA sequence and the coordinates where you sampled it. (We can easily extend this tool to handle multiple DNA sequences with CSV upload.)
Our tool will output the top three most probable genuses that your sample belongs to based on DNA and environmental factors of the sample location. You can also see the top three most probable genuses based on DNA similarity alone.
## DNA Embedding Space Visualization
Prehaps we have a DNA sequence for which the highest genus probability is very low (this could be because scientists have not managed to directly sample any specimens of the genus, so our training dataset, BOLD, doesn't contain any examples), we can still examine the DNA embedding of the sequence in relation to known samples. The left t-SNE plot show the embedding space of the top N most common species in the area surrounding the given coordinate. We can see clear group distinctions between species. The right t-SNE plot show how the sample sequence embedding is positioned in the space and identified nearest species clusters.
# Future Work and Downstream Tasks
We describe interesting avenues for future work that were not in scope due to time constraints.
Future Tasks for the DNA Identifier Tool:
- Include a CSV upload task to process many DNA sequences at once
- Include more ecological layers such as layers peratining to soil properties
- BOLD has very interesting other columns such as depth, elevation, habitat (text description), and others, but they are extremely sparse (>90%). We could add these as optional features to our Tool and update our model to handle these input
- Add a Retrieval Augmented Generation LLM tool that scrapes traditional/ecological knowledge about species to help predict genus
Potential downstream tasks include:
- Identifying invasive species
- Reclassifying wrongly classified species, e.g. a red panda is called a panda, but it's actually more genetically similar to a raccoon
# Thank You
This tool was developed as part of the GainForest EcoHackathon AI for Biodiversity Track. Thank you very much to the GainForest team and all mentors for such an engaging and fun EcoHackathon! <3 Lofi Amazon Beats
<!-- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference -->