--- title: DNA Identifier Tool emoji: 📉 colorFrom: green colorTo: yellow sdk: gradio sdk_version: 4.32.1 app_file: app.py pinned: false --- # Welcome to Lofi Amazon Rainforest Beats to Hack/AI to's DNA Identifier Tool. This tool is intended to help conservationists/biologists identify unmatched eDNA samples or verify known samples by predicting genus from DNA sequences. If unsure, the tool can also visualize the DNA embedding space to help one hypothesize about which species the sequence could belong to. Conserving and monitoring biodiversity is crucial but challenging, especially in remote and densely vegetated areas. Current methods like camera traps and bioacoustic monitoring require processing huge stores of video/audio feeds. Additionally, satellite imagery analysis is challenging in areas with constant cloud cover or dense canopy cover which may obscure the true conditions on the ground. Found in various states of decay within water, soil, or sediment, DNA can last from a few hours in temperate waters to millennia in cold, dry permafrost. These so called environmental DNA (eDNA) samples allow for the direct extraction of DNA without any traces of the organism itself, offering a much less labor intense way to monitor biodiversity-- that is if the DNA sequences can be identified. The current method to identify eDNA sequences involves searching for a match within 2-3% difference in an incomplete reference libary comprising direct specimen DNA samples (BOLD). However, many species are elusive or living in inacessible regions making direct sampling infeasible. We attempt to overcome the limitations of traditional species identification by using ecological layer data and environmental eDNA. We hypothesize that besides pure DNA similarity, there may be knowledge about the area in which a sequence was found that can give clues as to what the sequence could be. We introduce the largest DNA barcode model, trained on a global dataset of over five million sequences gathered from the Barcode of Life Data System([Ratnasingham and Hebert, 2007](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1890991/)), and a comprehensive dataset from the Amazon rainforest, including DNA sequences and ecological layer data describing the coordinates where each sequence was sampled (https://huggingface.co/datasets/LofiAmazon/BOLD-Embeddings-Ecolayers-Amazon). The layers we use describe Annual Mean Air Temperature, Temperature Seasonality, Annual Precipitation, Precipitation Seasonality, Human Footprint, Elevation, and Population Density. Our findings show that our DNA model clusters species effectively in the embedding space with modest training. Additionally, incorporating ecological layer data improves accuracy in genus classification tasks. This integration of eDNA and ecological layers offers a scalable method for advanced biodiversity analysis and conservation. ### Genus Prediction Enter a DNA sequence and the coordinates where you sampled it. (We can easily extend this tool to handle multiple DNA sequences with CSV upload.) You can choose between two methods to predict the most probable genus. 'Cosine' will calculate the cosine similarity between the embeddings of your unidentified eDNA sequence and existing labelled sequences to determine the most probable genuses; this method is not aware of environmental data. 'fine_tuned_model' will output the predictions of a model trained on DNA embeddings and ecological layer data to predict the most probable genuses. A plot of the most probable genuses is shown. ### DNA Embedding Space Visualization Prehaps we have a DNA sequence for which the highest genus probability is very low (this could be because scientists have not managed to directly sample any specimens of the genus, so our training dataset, BOLD, doesn't contain any examples), we can still examine the DNA embedding of the sequence in relation to known samples. The t-SNE plot on the left shows the embedding space of the top k most common species in the dataset. Use the slider to choose k. We can see clear group distinctions between species. The t-sne plot on the right shows the DNA embedding spaces of the k most likely genera for the DNA sequence you provided compared to your DNA sequence's embedding. ## BarcodeBERT DNA Embeddings. The model we used to train the DNA embeddings is [BarcodeBERT](https://github.com/Kari-Genomics-Lab/BarcodeBERT). We trained the model on the 'nucraw' column of DNA sequences from the latest release of the [BOLD Database](http://www.boldsystems.org/index.php/datapackages/Latest). We followed the preprocessing steps outlined by the [BarcodeBERT paper](https://arxiv.org/pdf/2311.02401). ## Classification Model Performance We trained a fine-tuned single fully connected linear layer on our DNA embeddings with ecological layer data and without ecological layer data to predict genuses. We found that the model achieved a test accuracy of 82%. Our results can be validated with the [BarcodeBERT-Finetuned-Amazon](https://huggingface.co/LofiAmazon/BarcodeBERT-Finetuned-Amazon) model, testing split of the [BOLD-Embeddings-Ecolayers-Amazon](https://huggingface.co/datasets/LofiAmazon/BOLD-Embeddings-Ecolayers-Amazon) dataset, and the fine tuning script on our [repo](https://github.com/vshulev/amazon-lofi-beats/blob/master/fine_tune.py). ## Future Work and Downstream Tasks We describe interesting avenues for future work that were not in scope due to time constraints. Future Tasks for the DNA Identifier Tool: - The tool was intended to have two t-SNE plots: one showing the embedding space of the top k most common species in the area surrounding the given coordinate, and the other showing how the user's sequence embedding is positioned in the space and show its nearest species clusters - Include a CSV upload task to process many DNA sequences at once - Include more ecological layers such as layers peratining to soil properties - Add legends with images of each genus - Compare more models for the genus classification task, like the [BarcodeBERT-Finetuned-Amazon-DNAOnly](https://huggingface.co/LofiAmazon/BarcodeBERT-Finetuned-Amazon-DNAOnly) model - BOLD has very interesting other columns such as depth, elevation, habitat (text description), and others, but they are extremely sparse (>90%). We could add these as optional features to our Tool and update our model to handle these input - Add a Retrieval Augmented Generation LLM tool that scrapes traditional/ecological knowledge about species to help predict genus Potential downstream tasks include: - Identifying invasive species from highly confident DNA matches of a sequence seen far from its native territory. - Reclassifying wrongly classified species, e.g. a red panda is called a panda, but it's actually more genetically similar to a raccoon - Investigating how environmental factors affect DNA sequences, e.g. mutations. # Thank You This tool was developed as part of the GainForest EcoHackathon: AI for Biodiversity Track. Thank you very much to the GainForest team and all mentors for such an engaging and fun EcoHackathon! <3 Lofi Amazon Beats