AI for Food Allergies

Community Article Published October 16, 2025

Let’s get straight to the point: worldwide, an estimated 220 million people suffer from at least one food allergy, and in the United States alone, this accounts for roughly 10% of the population. This means that if you don’t have an allergy, you’ll likely know someone who does — and it’s not a pleasant situation to be in. This condition affects not only patients’ physical health but also takes a significant toll on their mental well-being and overall quality of life.

So, what can we do about it?

In recent years, biomedical research has made several remarkable advances: from experimental vaccines and desensitization-based immunotherapies to improved diagnostic tools capable of identifying specific allergen sensitivities with unprecedented precision. These developments are pointing us in the right direction toward building long-term immune tolerance, but we’re not quite there yet.

In the meantime, we’ve also witnessed groundbreaking progress in artificial intelligence applied to biology and medicine. Models like AlphaFold and Boltz-1 have revolutionized protein structure prediction, while AI-driven approaches in genomics, drug discovery, and molecular modeling are accelerating the pace of biomedical innovation. The convergence of these worlds is opening up new possibilities for understanding, predicting, and ultimately treating complex immune conditions such as food allergies.

allergens_folded Four among the major allergenic proteins folded by AlphaFold. Up left to bottom right: glycinin (soybean), ovalbumin (egg), alpha lactalbumin (milk), ara-h-2 (peanut).

Our vision with the AI for Food Allergies project is to build the first community-driven research lab dedicated to exploring how artificial intelligence can meaningfully advance the field of food allergy research. We aim to bridge the gap between cutting-edge AI and biomedical science by developing open, collaborative projects that contribute tangible value to researchers, clinicians, and patients alike.

Current State of The Art: Where AI Meets Food Allergy Research

The last couple of years have been transformative for food allergy research. Artificial intelligence, once limited to image recognition or text translation, now operates comfortably in the biological and regulatory spaces that define food safety.

This evolution began with early bioinformatics techniques utilized sequence alignment and physicochemical descriptors to detect and flag potential allergens. Databases such as SDAP and AllergenOnline were used to identify cross-reactive proteins. Machine-learning algorithms such as AllerHunter, and NetAllergen later enhanced these methods, training on thousands of known allergens and non-allergens to improve predictive accuracy.

Today, at the molecular level, deep learning models like ProtBERT, ESM-2, and AllergenBERT can analyze amino-acid sequences to predict whether a protein might act as an allergen. They identify subtle biochemical patterns, sequence motifs, secondary-structure signals, and epitope similarities, which correlate with immune reactions. For example, AllergenAI applies convolutional neural networks to allergen sequences from SDAP 2.0, COMPARE, and AlgPred 2, uncovering motifs essential for IgE binding and demonstrating the promise of integrating structural data into prediction pipelines What used to require months of lab experiments can now be screened computationally, dramatically accelerating allergen discovery in novel foods and plant-based proteins.

Concurrently, AI is expanding the scope of allergy therapeutics through advances in drug-target interaction (DTI) modelling. Deep neural networks, graph neural networks and transformer models utilize data from chemogenomic datasets such as DAVIS, PDBbind to predict binding affinities, enabling virtual screening of compounds that can potentially inhibit IgE–FcεRI binding or modulate inflammatory pathways. Multimodal datasets that contain molecular structures, transcriptomics and imaging readouts can be utilized for tasks such as small molecule generation, prediction of properties and assessment of immune cell response. The subsequent subheadings present critical datasets supporting the mentioned AI approaches and explain how each resource is employed in food allergy drug design.

In clinical research, AI is helping refine diagnostics. Traditionally, allergists rely on a mix of skin-prick results, serum-specific IgE levels, and patient history, but interpreting these together is difficult. Machine learning models have begun combining these modalities to estimate the true probability of a food allergy, reducing unnecessary oral food challenges and improving patient safety. Importantly, these models don’t replace doctors, they simply reduce uncertainty and provide interpretable probabilities rather than binary outcomes.

On the consumer and regulatory side, advances in natural language processing (NLP) and computer vision (CV) have made it possible to read and understand ingredient labels at scale. NLP models trained on multilingual data can detect hidden or misspelled allergen names (“tahini” → sesame, “paneer” → dairy), while vision models can read curved, low-light packaging and extract ingredient text more reliably than standard OCR systems. Combined with live monitoring of FDA and USDA recall feeds, AI can now alert consumers to undeclared allergen risks in near real time.

Awesome Food Allergy datasets

awesome_food_allergy_datasets

The need for data

A fundamental step in applying Machine Learning to this field is having access to high-quality data. As highlighted by Channing and Ghosh in their position paper AI for Scientific Discovery is a Social Problem, the real challenge in ML for science goes beyond advanced models and powerful GPUs. It lies in the scarcity, fragmentation, and inaccessibility of data. This issue is particularly evident in the biomedical domain, where data gatekeeping, inconsistent standards, and lack of interoperability often hinder collaboration and slows down progress.

Collection release

The first milestone of our community is dedicated to addressing this very challenge. We have curated Awesome Food Allergy Datasets, the first open collection of datasets on food allergies, meticulously annotated and categorized to serve as a foundation for future research. By making this resource openly accessible, we aim to accelerate discovery, foster collaboration, and lower the entry barrier for researchers and innovators interested in applying AI to this critical field.

dataset_stats Stats about the distribution of our datasets by data type, category and public availability.

We organize this resource into three complementary layers, each designed to serve a specific part of the AI-for-Food-Allergies ecosystem.

🧬 The Protein and Molecular Allergenicity Layer

At the molecular level, we are assembling what may become the most complete open dataset for allergen and protein analysis ever built. It merges classical allergen repositories with next-generation molecular and drug-target databases, enabling deep learning models to move seamlessly from sequence to structure to immune response.

This layer draws from trusted allergen-focused sources such as WHO/IUIS Allergen Nomenclature Database, AllergenOnline, Allergen30,AllerBase, AllFam, Allermatch, AllerHunter, AllerCatPro 2.0, AllergenAI,NetAllergen, AllerTOP v1.1, Alleropedia, Allergome, and the Allergen Family Database. These provide verified allergenic and non-allergenic protein sequences, family classifications, and cross-reactivity annotations.

To capture the biochemical and structural side of allergenicity, we integrate resources like SDAP 2.0, PDBBind+, ProPepper, and quantum-chemistry datasets including nabla²DFT, QM, QDπ, QCML, and QCDGE. These datasets provide molecular surfaces, binding affinities, and electrostatic descriptors that help AI models learn why certain proteins interact with IgE antibodies.

Because allergic response often overlaps with pharmacology, this layer also incorporates drug–target and compound databases such as DAVIS, QSAR, e-Drug3D, Stanford Drug Data, DrugCentral, MedKG, Therapeutic Target Database, STITCH, Probes & Drugs, IUPHAR Pharmacology, and Enamine REAL. These enable studies of cross-reactivity between allergens and drugs, side-effects that mimic allergic reactions, and opportunities for immunomodulatory therapy.

Each record in this unified dataset is annotated with sequence data, taxonomy, molecular descriptors, and literature references. We apply homology reduction to prevent data leakage between training and test sets and evaluate model quality using AUROC, AUPRC, MCC, and calibration scores. This layer serves as the foundation for building transformer-based models that predict allergenicity, drug–allergen interactions, and cross-reactive epitopes.


🏥 The Clinical, Immunological, and Therapeutic Layer

Allergies begin at the immune level, and understanding that requires human data. The second layer combines immunology, clinical, and trial datasets to help researchers model how allergic sensitization, tolerance, and treatment evolve over time.

From the immunological perspective, we include the IEDB (Immune Epitope Database) and its Analysis Resource, alongside specialized datasets like AlgPred 2.0, Allergen30, Allergen Peptide Browser, and ProPepper, which map B- and T-cell epitopes and antibody binding regions.

For studying patient-level outcomes, we integrate clinical and population datasets such as Food Anaphylaxis ML Dataset (TIP), Food Allergy Risk Stratification Dataset, Food Allergy & Intolerance Dataset, and AllergyMap. Large-scale cohorts like HealthNuts, CHILD, and DIABIMMUNE, plus microbiome-focused datasets (e.g., Dysfunctional Gut Microbiome Networks in Childhood IgE-Mediated Food Allergy and Akkermansia muciniphila in Fibre-Deprived Mice), enrich this layer with genetic and microbial context.

Simulated datasets — such as the Simulated Allergen Immunotherapy Trials Dataset, Simulated AIT Trials Dataset, and FARE Food Allergy Research data — allow us to model the long-term response to desensitization therapies without exposing patients to risk.

Genetic and biochemical variability is represented through GWAS, DNA Methylation GSE59999, and the Human Metabolome Database. These allow multi-omics studies of how genes, metabolism, and environment combine to shape allergic disease.

Together, these resources form the backbone for predictive models that estimate reaction risk, identify candidate biomarkers, and simulate therapy outcomes — a foundation for safer, more personalized allergy care.


🌿 The Food, Ingredient, and Regulatory Layer

The final layer connects lab science to real-world food safety. Here, we focus on datasets that describe what consumers actually eat, how products are labeled, and how authorities respond to allergen incidents.

We curate large multilingual ingredient and product databases such as Open Food Facts, Food Ingredients and Allergens, Ingredients with 16 Allergen Tags, Allergen Status of Food Products, and FSA Allergen Database Service (UK Nut Allergy Registry). Complementary regulatory datasets include Swiss Legislation on Food Allergens, COMPARE, and the FSA Allergen Database Service, which provide consistent allergen codes and labeling standards.

For real-world adverse-event tracking, we rely on CAERS (CFSAN Adverse Event Reporting System), PEAR – Partners’ Enterprise-wide Allergy Repository, and Food: Allergen and Allergy, which capture anonymized clinical reports and recall histories. Government recall sources from FDA, USDA, and CFIA, as well as global registries, are continuously ingested to monitor undeclared-allergen events and labeling failures.

These datasets feed into our Multilingual Ingredient and Label Corpus, where text data are normalized through an ontology that maps local terms (“tahini,” “gingelly,” “sesame paste”) to canonical allergens. Synthetic label images are generated to mimic supermarket conditions — glare, blur, curved surfaces, and multi-language fonts — allowing models to learn in realistic settings.

By combining structured recall data with visual and linguistic information, this layer empowers AI systems that can read packaging, understand its content, and flag inconsistencies in real time.

Accessing the collection

Our collection is available through our dedicated Hugging Face datasets repository. You can explore it interactively using the hf space we've developed, which features name-based search along with convenient filtering options by category, task, and data type.

web_view Interactive HuggingFace space

Contributing

We welcome contributions! Our datasets list is maintained on a dedicated GitHub repository where you can submit pull requests to help us grow the collection.

What’s coming next?

This first work is a testament on the fact that community-driven open science not only is possible, but is a great idea. Take our case: in just a few weeks, more than 20 contributors from different backgrounds came together working on a food allergy related project. This shows how even a specialized scientific topic perceived as nieche can spark geniune interest and momentum.

It’s our 0-to-1 moment: proof that when people unite around a clear purpose, even a small initiative can grow into something transformative. And who knows — maybe one day, food allergy research will have its own AlphaFold moment.

Looking ahead, our focus will shift toward hands-on, scientifically meaningful projects that build on this foundation. Guided by scientific advisors and domain experts, we aim to foster collaborative, community-driven research that advances the science of food allergies.

Our goal is to harness the power of AI to tackle key scientific questions, such as:

  • Can we enable early diagnostics to predict or detect food allergies before they develop?
  • Can AI help design more effective immunotherapies that promote long-term tolerance?
  • Is it possible to engineer new hypoallergenic foods through intelligent design?
  • And many others!

💡 Get Involved

Whether you’re a researcher, student, developer, or simply passionate about open science, we’d love to have you join us.

👉 Apply to contribute or collaborate via our short interest form

💬 Join the HuggingScience discord community to connect, discuss ideas, and build the future of AI for food allergy research together. For any question, you can reach out to @ludocomito, the team leader for this project.

🌐 Visit our community wiki to learn more about our initiative and keep track of active projects.

Final remarks

The realization of this first project has been possible by the coordinated effort of our contributors, showing that indeed open science is a viable way. In particular, thanks to:

  • Shreya Mishra, Aashish Anand, Dhia Naouali for elaborating the data and setting up the whole repository.
  • Akhil Theertala, Vaibhav Pandey, Reuben Chagas Fernandes for developing the interactive space for our collection.
  • Antonis Vozikis, Kisejjere Rashid, Vaibhav Pandey for collaborating on writing the article.

Moreover, thanks to the 20+ contributors who worked on finding and annotating the datasets for our collection.


📚 Appendix

A thorough explanation of key datasets we identified, together with some inspiration for possible food allergies applications.

SDAP 2.0: Structural Database of Allergenic Proteins

SDAP 2.0 is a web server with a database of allergenic proteins and computer programs that help in structural biology research. It allows access to the cross-reactivity between known allergens, screens FAO/WHO allergenicity guidelines for new proteins and predicts IgE-binding ability of genetically modified food. Its activities include anti-allergy drug design, protein structure analysis, prediction of epitopes and prediction of cross-reactivity. SDAP 2.0 contains 1657 hand-curated allergen sequences, 334 experimentally validated and 1565 predicted structures with tools such as property distance and Cross-React to identify IgE-binding epitopes and cross-reactive allergens (Updated Structural Database of Allergenic Proteins). Hypoallergenic protein design and immunotherapies are aided by the database as it allows researchers to display epitopes, align structural motifs and screen candidate mutations. For the food allergies, SDAP 2.0 can be combined with DTI data sets to model how small molecules or peptides would interfere with IgE–epitope binding. For AI researchers, SDAP’s rich dataset of allergen structures and epitopes serves as a basis for training models to predict IgE-binding sites or assess how modifications to protein structure might reduce allergenicity.


DAVIS: Kinase inhibitor binding affinities

The DAVIS data set contains dissociation constants for 68 drugs against 379 protein targets. It is widely used in benchmarking drug-target interaction prediction models and anti-allergy drug design tasks. Frontiers in Pharmacology recognizes that the Davis dataset provides 30,056 drug–target affinity samples with K_d values that are traditionally used to train sequence-based deep-learning models (review of the recent advances on predicting drug target affinity), as these pairs are equiped with continuous affinity measures. Although assembled initially for use with kinase inhibitors, the structure–activity pairs available in the dataset can be repurposed for allergy drug discovery by linking inflammatory pathway targets (e.g., SYK, PI3K) and screening molecules blocking IgE signalling. Because the dataset lacks 3D structures, it is often completed with PDB or ZINC structures for modeling.


QsarDB: repository for (Q)SAR models

QsarDB is a smart repository that holds quantitative structure–activity relationship (QSAR) and quantitative structure–property relationship (QSPR) models along with their datasets. Its operations entail providing access to peer-reviewed, open (Q)SAR models for anti-allergy drug design. The QsarDB repository stores models in content-aware form and offers facilities for analysis, visualization and prediction (QsarDB). For allergy research, annotated models in QsarDB allow the rapid estimation of physicochemical parameters (e.g., lipophilicity, solubility) or biological activity of candidate molecules, which allows triage before the actual execution of DTI simulations. The repository emphasizes transparency and reproducibility; a model consists of documentation and citations, which is important when regulatory agencies require substantial evidence for novel allergen therapeutics.


e-Drug3D Database

e-Drug3D is a three-dimensional database of drug-like molecules and their molecular conformations. It provides structural data for structure-based drug design and is utilized for anti-allergy drug design and DTI applications. According to the official website and an ACS Med Chem Lett paper, e-Drug3D contains over 2,162 structures of FDA-approved compounds with molecular weights ≤2000, including pharmacokinetic and pharmacodynamic information such as volume of distribution, clearance and half-life (e-Drug3D, Datasets FDA Approved Drugs). By offering conformers for approved drugs and active metabolites, e-Drug3D allows virtual screening for drug repurposing. For food allergy, researchers can search for molecules that are inhibitors of histamine release or blockers of allergic signal transduction and screen off-target effects with structural similarity.


Stanford Drug Data: Offsides/Twosides

The dataset includes the Offsides database of drug and drug–drug interaction side-effects and the Twosides database of side-effects of drug–drug interactions. The Sci Translational Medicine article on it notes that the authors built a large database of drug effects and side effects of drug–drug interactions, making it possible to correct confounding factors of adverse event reports (Data-Driven Prediction of Drug Effects and Interactions). By integrating this dataset with allergy-focused data, AI models are able to predict possible side-effects or cross-reactive adverse effects when on antihistamines or biologics and identify interactions that exacerbate food allergy disease.


DrugCentral: open drug information repository

Open source web-based repository of over 4,950 drugs including structural, physicochemical and pharmacological data. It supports anti-allergy drug design and DTI operations. A 2023 review says that DrugCentral covers ~20,000 bioactivity data points, 724 mechanism-of-action targets, >14,300 on- and off-label indications, 27,000 contraindications and ~340,000 adverse drug events (DrugCentral). It relates every drug to curated mechanism-of-action targets and approved indications. For AI researchers working on allergy, DrugCentral provides a ready catalog of all the drugs used in allergic conditions (e.g., epinephrine, antihistamines, corticosteroids, leukotriene modifiers, monoclonal antibodies like omalizumab) and their target profiles. That is, an AI model can just query what approved drugs target IgE or histamine or IL-5, etc., and utilize that as curated training data. As an example, to train a model to predict new antihistamines, one can fetch all histamine H1 receptor antagonists in DrugCentral to generate a positive set. DrugCentral data can also facilitate side-effect prediction models; knowledge of the polypharmacology of allergy drugs (most antihistamines also bind off-target receptors) can allow AI to predict adverse effects or cross-reactivity. Moreover, DrugCentral is widely used for training natural language processing models for pharmacology, as it contains text descriptions of drug indications and effects. It is possible to fine-tune an NLP model on DrugCentral entries to, e.g., summarize the potential action of a new compound, or translate between chemical structure and described effect (a kind of "Chemo-BERT" for drug repurposing).


MedKG: medical knowledge graph

MedKG is a strong medical knowledge graph involving data from 35 expert sources with 34 node types and 79 relations. MedKG authors share an integrative biomedical knowledge graph with continuous integration and update processes (MedKG). In other words, MedKG is a network of biomedical knowledge available for direct algorithmic digging. For food allergy, a knowledge graph is especially valuable to uncover latent relationships: e.g., between an allergen protein in food and the gene encoding it, to a pathway that it activates, to known drugs targeting that pathway. An AI conclusion on MedKG could propose drug repurposing targets – perhaps finding that a drug for the treatment of an autoimmune disease directed against a given interleukin is also applicable for the treatment of food allergy symptoms directed against the same interleukin. MedKG can further denote patient data (if integrated with EMR sources) which would allow AI models to provide predictions like allergy risk or treatment efficacy based upon graph algorithms. Because MedKG contains many types of data, one example of a real-world use is supplying its data to graph neural networks (GNNs) or knowledge graph embedding models to make new link predictions, such as making a prediction that Drug X can treat peanut allergy based on the neighborhood in the graph. Furthermore, the in-built molecular embeddings within MedKG help in correlating chemical space with biological effect, which is extremely useful in allergy drug design where we want to go from a target (e.g., IgE) to "find me a molecule that binds here and does not have toxicity.". In summary, MedKG is a cutting-edge, AI-friendly biomedical knowledge graph that offers an integrated view, enabling advanced machine learning algorithms to generate knowledge for allergy therapeutics and personalized medicine


PDBBind+: protein-ligand binding database

PDBBind+ refers to an enhanced, “leak-proof” reorganization of this dataset to improve its quality and splitting methodology (PDBBind+). It offers a carefully chosen collection of protein–ligand pairs with known affinities, such that the training and test sets exhibit no significant overlap in either protein or ligand similarity, which is vital for building robust AI models. For the area of allergies, PDBBind/PDBBind+ constitutes the foundation for drug–target affinity predictive models. There are a number of targets of relevance in allergic disease (e.g., inflammatory mediator receptors, arachidonic acid metabolism enzymes, etc.) with structures in the Protein Data Bank and their respective ligands in PDBBind (PDBBind). By training on PDBBind+, AI models can be instructed to predict how closely a small molecule will bind to a protein, which is a critical factor in designing drugs to block allergic pathways. For example, if somebody wishes to find new inhibitors of mast cell tryptase (an enzyme in allergic reactions), a previously trained model on general binding data can be re-fitted to whatever tryptase–inhibitor data exist and apply this to virtually screen compounds. Further, PDBBind's focus on 3D structure is orthogonal to ligand-based datasets like DAVIS: combined, they can give better structure-based AI predictions. The "Plus" version's focus on data quality and unbiased analysis ultimately leads to AI predictions being more reliable – a necessity when translating into real-world drug discovery for allergies.


Human Metabolome Database (HMDB)

The HMDB is an exhaustive database of human small-molecule metabolites, with over 220,000 metabolite records found in the human body (HMDB). It includes comprehensive chemical details, clinical information (normal and abnormal concentration ranges in biofluids), and references to enzymatic pathways for each metabolite. Far from being an allergy database per se, HMDB is incredibly valuable for studying allergy from the point of view of biomarker discovery and mechanistic insights.Allergic reactions result in release or usage of other metabolites – e.g., histamine (a biogenic amine) is a significant metabolite released from mast cells, and lipid mediators like prostaglandins and leukotrienes (also in HMDB) are produced in allergic reactions. AI algorithms can utilize HMDB to identify metabolic signatures of allergy: using machine learning to apply metabolomic profiles of allergic patients versus controls, it is feasible to find a collection of metabolites that would indicate an impending anaphylactic event or quantify the size of an allergic event. HMDB would provide the benchmark of what those metabolites are and under what biochemical circumstances. Also, during drug design, metabolism needs to be known about – many allergy drugs (e.g., corticosteroids, leukotriene modifiers) include active or inactive metabolites. An AI drug metabolism prediction model would use HMDB's data to predict whether a new anti-allergy molecule will be metabolized into poisonous waste or how it would be excreted. HMDB's relationship of metabolites to pathways and enzymes also suggests that if we're examining gut microbiome actions or dietary interventions on allergy, AI can use HMDB to relate diet-derivative metabolites or microbiota metabolites (like short-chain fatty acids) to immune modulation. In essence, HMDB offers the chemical and clinical context in which to view the metabolic component of allergy so that AI models can bridge proteins and genes to the universe of small molecules that ultimately instigate or detect allergic disease hmdb.ca.


Therapeutic Target Database

TTD is a comprehensive database of discovered and documented therapeutic targets that are linked to the drugs targeting them and the diseases they are associated with (TTD). In addition to protein targets (enzymes, cytokines, receptors, etc.), TTD also includes nucleic acid targets, along with their pathways and other annotation. In food allergy and asthma (allergic disease clinical manifestations), TTD is a handy information base: it enumerates targets like IgE, IgE receptors, IL-5, IL-13, TSLP, CRTH2, and other immune molecules being targeted for allergy treatment. For use in AI, TTD can be used to define task objectives and construct training data. For instance, one can request TTD for all targets with "Asthma" or "Allergic rhinitis" – TTD would return a list of studied or validated targets and known ligands/drugs. This can be utilized to direct constructing datasets for drug–target interaction prediction specifically for the category of allergy. Additionally, TTD has information on all the drugs (clinical status, mechanism, etc.), and hence a model could be learned to predict drug efficacy or development phase as a function of target attributes (enabling drug repurposing knowledge). More broadly, TTD's curated set of target–disease–drug associations is a fertile ground on which knowledge graphs can be built or reasoner models that reason over biological networks can be learned. For example, a knowledge graph AI can use TTD data to identify connections between food allergy and other drugs that already address a molecular pathway and suggest off-label applications. Briefly, TTD bridges the gap between molecular target and clinical outcome and allows AI to identify where and how to disrupt the allergic process and with what agents.


Therapeutic Data Commons (TDC)

TDC is an effort which provides a variety of benchmark datasets and tasks standardized to be AI-ready across the drug discovery and development pipeline (TDC). It spans across over 20 categories of learning tasks – from QSAR property prediction to DTI prediction to drug–drug interaction to clinical outcome modeling – and for each task it collects benchmark datasets with reproducible splits, evaluation measures, and baseline results. The advantage of TDC to AI in drug design for allergies is twofold: (a) It comes with pre-prepared data suitable for what we want to do (e.g., TDC has ADMET data which can be utilized to guarantee a new allergy drug is not highly toxic, or DTI data like DAVIS and KIBA which we have already discussed for binding affinity). (b) It gives a framework for testing models on these tasks in a fair manner. By applying TDC, a researcher can readily determine which models are best at, for instance, predicting binding to a particular allergy-related target or which model most accurately predicts a compound's side effect profile (some side effects such as drowsiness are important in allergy medication). In addition, TDC is continually expanding (now into multimodal and generative tasks), in line with the trend of using various forms of data (e.g., with chemical data along with cell pictures or with text). For instance, an Early Detection of Allergies initiative might import patient health records (if available) – TDC enables importing such clinical data sets and evaluating predictive models (for instance, who will have severe food allergies, based on their medical history – much like the risk stratification data set mentioned in clinical environments). On the whole, TDC does not import new domain-specific information, but rather seeks to most effectively use existing data. By using TDC's benchmarks and tasks, our AI allergy models can be rigorously trained and tested so that when we claim a new model finds, for example, better drug candidates or better allergy predictions, the claim is based on rigorous comparative evaluation.


STITCH: Chemical–Protein Interaction Database

STITCH is a database that integrates known and predicted protein–small molecule (and drug) interactions. It gets evidence from diverse sources: experimental evidence, metabolic pathway databases and binding assays, text mining of the literature, and computer predictions. Essentially, STITCH can be thought of as a big network with the nodes being proteins and chemicals and edges representing an interaction or binding relationship with some degree of confidence. For AI use, STITCH offers a precomputed data resource to train models against drug–target interaction (DTI) or perform network-based inference. In allergy research, STITCH can help identify what food chemicals or food additives cross-react with human proteins (e.g., do some food chemicals interact with immune receptors and act as adjuvants in allergy?). Or it can list all the proteins one anti-allergy drug binds to – useful for polypharmacology modeling. An AI system might use STITCH data to predict on novel interactions: i.e., finding that a food crop flavonoid would bind and block IgE or mast cell receptors and thus be a potential allergy drug. The inclusion of text-mined data leads to an extensive coverage with anecdotal or less-documented interactions that could fall through the cracks of other curated databases. Graph-based AI algorithms function well with such dense relationship data. We can train a graph neural network on the STITCH network and possibly get it to predict new edges (interactions) – e.g., predict which existing drugs would interact with the peanut allergen Ara h 2 (if we consider allergens as "proteins" in the network) to inhibit its IgE binding. While not an allergy-specific tool, STITCH is an essential one to enable systems pharmacology approaches, and their integration ensures that our AI platform is capable of considering the global interaction network in addressing allergy drug design allergy drug design.


M3-20M Multi-Modal Molecule Dataset

M3-20M is a very large open-access dataset of 20 million molecules, designed to support AI-driven drug design with a multi-modal approach (M3-20M). Each molecule in M3-20M is provided with multiple representations: its 1D SMILES string, 2D graph structure, 3D conformation, a set of computed physicochemical properties, and even a textual description generated to summarize the molecule’s features. The integration of these modalities (chemical, structural, and linguistic) offers a rich playground for modern AI models (like graph neural networks and transformer-based models) to learn chemical concepts. For allergy drug design, a dataset like M3-20M can be invaluable in training generative models to propose new compounds or in predictive models to estimate properties (e.g., oral bioavailability or toxicity) of candidate anti-allergy drugs. Since it’s multi-modal, one interesting use could be training a model that, given a desired function (like “histamine H1 receptor antagonist” or “mast cell stabilizer”), can generate a molecule’s description and structure that fits the profile – effectively bridging natural language and chemical design. The sheer scale (20 million compounds) also means that an AI can be exposed to a wide chemical space, including many drug-like and lead-like molecules. This improves the chances of discovering novel molecules that could serve as next-generation allergy therapeutics. In summary, M3-20M is a cutting-edge resource pushing the boundaries of how AI can learn from big chemical data, directly benefiting the search for safe and effective anti-allergic compounds.


SAIR (Structurally Augmented IC Repository)

SAIR is a recently released large dataset to accelerate AI in drug discovery, and it is particularly promising for allergy therapeutics (SAIR). SAIR consists of over 1 million protein–ligand pairs with experimentally measured binding affinities and 5.2 million 3D co-folded structures of the protein-ligand complexes. That is, for each protein target in the dataset, numerous small molecules (of known potency) are docked into it, providing a rich structural training set for deep learning models. For allergies, SAIR includes many protein targets of allergic disease – i.e., immunological enzymes, mast cell or basophil receptors, cytokines, etc. – and molecules that modulate them. Machine learning algorithms trained on SAIR have the potential to learn how to predict the affinity with which a given molecule will bind to a target, making them useful for virtual screening of candidate anti-allergy drugs. For instance, one may train on SAIR to build a model to find novel high-affinity blockers of the IgE–FcεRI interaction or inhibitors of key cytokines (e.g., IL-4 or TSLP) in allergic inflammation. The size of SAIR's structure–activity data (with millions of examples) also allows for the training of structure-aware AI models with better generalization. By spanning a vast chemical space and many protein conformations, SAIR allows AI models to more accurately predict binding even to new or slightly different targets. This makes it a very valuable resource for the design of small-molecule therapeutics in food allergy (e.g., mast cell stabilizers or IgE-neutralizing reagents).


AllerBase

AllerBase is a database of allergenic proteins and their properties, built to integrate data from diverse sources (e.g., IUIS allergen listings, Allergome, and literature) with stringent experimental validation (ALLerBase). It houses comprehensive entries for known allergens and includes an extensive collection of validated IgE epitopes (over 1,100 IgE-binding peptide sequences from 117 allergens). This is a valuable asset for allergenicity prediction AI models: AllerBase positive (allergen) and negative (non-allergen) examples can be utilized to train machine learning algorithms to identify proteins inducing IgE responses. The fact that epitope data is also provided further allows AI to determine what regions of an allergen are immunoreactive, informing the design of hypoallergenic protein variants (by modifying or removing key epitopes). In summary, AllerBase provides the ground-truth allergen information driving many AI classification and epitope-mapping software in allergy research.


AlgPred 2.0 Dataset

The AlgPred 2.0 dataset and webserver represent landmarks in allergen prediction using machine learning (AlgPred 2.0 ). This dataset contains 10,075 experimentally confirmed allergen sequences and an equal number of non-allergens, and 10,451 experimentally confirmed IgE epitopes for training models. From this data, AlgPred 2.0 trains ensemble classifiers that combine BLAST similarity, epitope mapping, motif discovery, and machine learning to achieve high accuracy (AUC ~0.98) in discriminating allergens from non-allergens. From a practical perspective, this dataset is an AI goldmine: models trained on it can predict if a novel protein (e.g., novel food protein or biopharmaceutical) might be allergenic, guiding safer design. The fact that the recognized IgE epitope sites are among them also enables AI to highlight which areas of a protein are problematic, permitting bioengineers to edit those areas out. AlgPred 2.0 demonstrates how painstakingly curated allergen vs non-allergen datasets, when fed into modern algorithms, greatly enhance our ability for in silico screening for allergenicity risk.


AllerCatPro 2.0

AllerCatPro 2.0 is a novel protein allergenicity prediction tool that is the first to combine sequence similarity and structure features (AllerCatPro 2.0). It was trained based on so-called "the most comprehensive dataset" of allergenic proteins: 4,979 protein allergens, 162 low-allergenic proteins, and 165 autoimmune-allergen proteins, all strictly curated from authoritative databases (WHO/IUIS allergen list, COMPARE, FARRP, Allergome, etc.). AllerCatPro 2.0 leverages this dataset to forecast an input protein's allergenic potential by aligning it with familiar allergens founded on sequence motifs and 3D epitope surfaces. In allergy drug design AI, AllerCatPro 2.0's dataset and approach illustrate the power of multi-modal learning: models considering a protein's 3D structure in addition to sequence can more precisely identify allergenic proteins (or with certainty rule out truly non-allergenic ones). This is important for the development of therapeutic proteins or novel enzymes for food processing – AI algorithms can take into account whether the designed protein might inadvertently have the structure of a known allergen. In total, AllerCatPro 2.0 provides both a thorough allergen dataset and an example of an AI-based solution utilized in allergenicity risk assessment.


AllergenAI

AllergenAI is a platform and a deep learning model that was developed to predict, from just the amino acid sequence of a protein, its potential to be an allergen (AllergenAI). The developers of AllergenAI collated and preprocessed training data from three big allergen databases – SDAP 2.0, COMPARE, and AlgPred 2.0 – thereby utilizing thousands of sequences of known allergens and non-allergens as input for a convolutional neural network. By learning directly from sequence patterns, AllergenAI can recognize proteins as allergenic without relying on external features. This AI-by-sequence approach is especially useful for screening proteomes (e.g., proteins of a novel plant or novel protein sources like insects or lab-grown foods) to predict any allergenic hits. The model was also used to identify new potential allergens (e.g., the identification of proteins with high risk in foods like date palm or spinach that were not previously identified as allergens). For drug design and allergy therapy, the significance of AllergenAI is being able to direct protein engineering – one can try amino acid substitutions rapidly and receive an AI-predicted allergenicity score, which can direct vaccine candidate development or hypoallergenic variants with minimal IgE binding.


NetAllergen

NetAllergen-1.0 is a more recent machine learning pipeline (random forest-based) that integrates immunological context in allergen prediction (NetAllergen-1.0). It was built by first collecting a filtered dataset of IgE-binding allergens from AllergenOnline (the official repository of IgE-inducing allergens) and then removing redundancy for the purpose of having a clean dataset. Most notably, NetAllergen includes a novel feature for each protein: its computationally predicted MHC class II presentation propensity (a critical step in the way T-cells are activated in allergies). A mix of traditional features (motifs, sequence similarity, etc.) with MHC-II presentation scores assisted NetAllergen in achieving improved accuracy, especially on allergens with low sequence similarity with established allergens. This approach – including immune processing data – is very relevant to AI drug design for allergy. It suggests models can take into account not just whether a protein is similar to a known allergen, but whether it would be seen by the immune system (through antigen presentation). The high-quality dataset NetAllergen is drawn from (constructed from AllergenOnline and filtered out of duplicates) provides a gold standard for developing next-generation allergen predictors. Overall, NetAllergen demonstrates how AI models may be constructed by combining immunological knowledge, paving the way for the creation of proteins or peptides that can be sidestepped from being able to cause T-cell and IgE responses.


QM9: Molecular Property Prediction Dataset for Quantum Chemistry

QM9 is a standard dataset in molecular machine learning and contains approximately 134,000 small organic molecules with high-accuracy quantum-mechanical properties. Each molecule has 3D geometries and 13 computed physical and electronic properties such as dipole moment, isotropic polarizability, energies (HOMO/LUMO), enthalpy, and free energy (QM9). Molecules in the dataset are drawn from the GDB-17 chemical universe and are drug-like and chemically diverse, making QM9 a representative benchmark for graph neural networks, transformers, and equivariant models in chemistry.

Under the field of AI-based food allergy drug design, QM9 forms the core foundation to acquire quantum-accurate molecular representations which can then be improved upon domain-specific drug–target datasets such as DAVIS, PDBBind+, and SAIR. Through learning inherent relationships between molecular structure and physicochemical properties, QM9-trained models can extrapolate stability, solubility, reactivity, and binding potential of novel anti-allergy compounds. As an example, quantum-level properties from QM9 can guide AI models to predict small molecules inhibiting IgE–FcεRI binding with favorable energetic and pharmacokinetic properties.

Besides, QM9 is a significant pretraining dataset for generative AI models that create chemically reasonable, low-energy molecules of relevance to allergy therapeutics. Quantum characteristics of the dataset constraint physical plausibility on created compounds in a way that virtual screening or molecular optimization pipelines are maintained chemically plausible. Thus, QM9 is not directly aimed at allergy but forms the backbone of the molecular intelligence that new AI systems employ while designing secure and effective allergy medications.

Community

Sign up or log in to comment