BioMedIA / article_app.py
avacaondata's picture
añadidos detalles sabrosones
42f26d2
article = """
<img src="https://www.iic.uam.es/wp-content/uploads/2017/12/IIC_logoP.png">
<p style="text-align: justify;"> This app is developed by the aforementioned members of <a href="https://www.iic.uam.es/">IIC - Instituto de Ingeniería del Conocimiento</a> as part of the <a href="https://somosnlp.org/hackathon">Somos PLN Hackaton 2022.</a>
<h3 font-family: Georgia, serif;>
Objectives and Motivation
</h3>
It has been shown recently that the research in the Biomedical field is substantial for the sustainability of society. There is so much information in the Internet about this topic,
we thought it would be possible to have a big database of biomedical texts to retrieve the most relevant documents for a certain question and, with all that information, generate a concise answer that tries to convey the documents' information while being self-explanatory.
With such a tool, Biomedical researchers or professionals could use it to quickly identify the key points for answering a question, therefore accelerating their research process. Also, we would put important health-related information in the hands of everyone, which we think can have
a very good impact on society. Health is a hot topic today but should be always in the top of our priorities, therefore providing quick and easy access to understandable answers that convey complex information into simple explanations is, in our opinion, an action in the right direction.
We identified the need for strong intelligent information retrieval systems. Imagine a Siri that could generate coherent answers for your questions, instead of simplistic google search for you. That is the technology we envision, to which we would like the Spanish community of
NLP to get a little step closer. Hackaton Somos NLP 2022 is actually intended to impulse NLP tools in Spanish, as there is an imbalance between the amount of Spanish speakers and the percentage of Spanish models and datasets in the hub.
The main technical objective of this app is to expand the existing tools regarding long form question answering in Spanish, by introducing new generative methods together with a complete architecture of good performing models, producing interesting results in a variety of examples tried.
In fact, multiple novel methods in Spanish have been introduced to build this app.
Most of these systems currently rely on Sentence Transformers for passage retrieval (which we wanted to improve by creating Dense Passage Retrieval in Spanish), and use Extractive Question Answering methods. This means that the user needs to look
into top answers and then form a final answer in their mind that contains all of that information. This is, to the best of our knowledge, the first time Dense Passage Retrievals have been trained in Spanish with large datasets, and the first time a generative question answering model in Spanish
has been released.
For doing that, the first restriction we found was the scarcity of datasets for that task, which is exacerbated by the domain gap to the Biomedical domain. We overcomed this restriction by applying translation models from Transformers (specified in each dataset) to translate BioAsq
to Spanish, and by doing the same with LFQA (more info in the attached datasets). BioAsq is a big Question Answering dataset in English for the BioMedical domain, containing more than 35k question-answer-context triplets for training. We then used our translated version of BioAsq,
together with SQAC (15k triplets) and SQUAD-ES (87.5k train triplets), which also has a portion related to the BioMedical domain. This was very useful for training extractive QA models to provide for the community (you can find some in https://huggingface.co/IIC),
but also for building a Dense Passage Retrieval (DPR) dataset to train a DPR model, which is key for our App, as without almost perfect information for answering a question, the generative model will not produce any reliable answer.
The fragility of the solution we devised, and therefore also the most beautiful side of it when it works, is that every piece must work perfectly for the final answer to be correct. If our Speech2Text system is not
good enough, the transcripted text will come corrupt to the DPR, therefore no relevant documents will be retrieved, and the answer will be poor. Similarly, if the DPR is not correctly trained and is not able to identify the relevant passages for a query, the result will be bad.
This also served as a motivation, as the technical difficulty was completely worth it in cased it worked. Moreover, it would serve for us as a service to the NLP community in Spanish. For building this app we would use much of what we learned from the private sector in building systems
relying on multiple models, to deliver to the community top performing models for Question Answering related tasks, thus participating in the Open Source culture and expansion of knowledge. Another objective we had, then, was to give a practical example sample of good practices,
which fits with the didactic character of both the organization and the Hackaton.
Regarding the Speech2Text, there were existing solutions trained on Commonvoice; however, there were no Spanish models trained with bigger datasets like MultiLibrispeech-es, which we used following the results reported in Meta's paper (more info in the linked wav2vec2 model above). We also decided
to train the large version of wav2vec2, as the other ASR models that were available were 300M parameter models, therefore we also wanted to improve on that part, not only on the dataset used. We obtained a WER of 0.073, which is arguably low compared to the rest of the existing models on ASR
datasets in Spanish. Further research should be made to compare all of these models, however this was out of the scope for this project.
Another contribution we wanted to make with this project was a good performing ranker in Spanish. This is a piece we include after the DPR to select the top passages for a query to rank passages based on relevance to the query. Although there are multilingual open source solutions, there are no Spanish monolingual models in this regard.
For that, we trained CrossEncoder, for which we automatically translated <a href="https://microsoft.github.io/msmarco/">MS Marco</a> with Transformer, which has around 200k query-passage pairs, if we take 1 positive to 4 negative rate from the papers. MS Marco is the dataset typically used in English to train crossencoders for ranking.
Finally, there are not generative question answering datasets in Spanish. For that reason, we used LFQA, as mentioned above. It has over 400k data instances, which we also translated with Transformers.
Our translation methods needed to work correclty, since the passages were too large for the max sequence length of the translation model and there were 400k x 3 (answer, question, passages) texts to translate.
We solved those problems with intelligent text splitting and reconstruction and efficient configuration for the translation process. Thanks to this dataset we could train 2 generative models, for which we used our expertise on generative language models in order to train them effectively.
The reason for including audio as a possible input and output is because we wanted to make the App much more accessible to everyone. With this App we want to put biomedical knowledge in Spanish within everyone's reach.
<h3 font-family: Georgia, serif;>
System Architecture
</h3>
Below you can find all the pieces that form the system. This section is minimalist so that the user can get a broad view of the general inner working of the app, and then travel through each model and dataset where they will find much more information on each piece of the system.
<img src="https://drive.google.com/uc?export=view&id=1_iUdUMPR5u1p9767YVRbCZkobt_fOozD">
<ol>
<li><a href="https://huggingface.co/IIC/wav2vec2-spanish-multilibrispeech">Speech2Text</a>: For this we finedtuned a multilingual Wav2Vec2, as explained in the attached link. We use this model to process audio questions. More info: https://huggingface.co/IIC/wav2vec2-spanish-multilibrispeech</li>
<li><a href="https://huggingface.co/IIC/dpr-spanish-passage_encoder-allqa-base">Dense Passage Retrieval (DPR) for Context</a>: Dense Passage Retrieval is a methodology <a href="https://arxiv.org/abs/2004.04906">developed by Facebook</a> which is currently the SoTA for Passage Retrieval, that is, the task of getting the most relevant passages to answer a given question. You can find details about how it was trained here: https://huggingface.co/IIC/dpr-spanish-passage_encoder-allqa-base. </li>
<li><a href="https://huggingface.co/IIC/dpr-spanish-question_encoder-allqa-base">Dense Passage Retrieval (DPR) for Question</a>: It is actually part of the same thing as the above. For more details, go to https://huggingface.co/IIC/dpr-spanish-question_encoder-allqa-base .</li>
<li><a href="https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1">Sentence Encoder Ranker</a>: To rerank the candidate contexts retrieved by DPR for the generative model to see. This also selects the top 5 passages for the model to read, it is the final filter before the generative model. For this we used 3 different configurations to human-check (that's us seriously playing with our toy) the answer results, as generated answers depended much on this piece of the puzzle. The first option, before we trained our own crossencoder, was to use a <a href="https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1">multilingual sentence transformer</a>, trained on multilingual MS Marco. This worked more or less fine, although it was noticeable it wasn't specialized in Spanish. We then tried our own CrossEncoder, trained on our translated version of MS Marco to Spanish: https://huggingface.co/datasets/IIC/msmarco_es. It worked better than the sentence transformer. Then, it occured to us by looking at their ranks distributions for the same passages, that maybe by multiplying their similarity scores element by element, we could obtain a less biased rank for the documents, therefore only those documents both rankers agree are important appear at the top. We tried this and it showed much better results, so we left both systems with the posterior multiplication of similarities.</li>
<li><a href="https://huggingface.co/IIC/mt5-base-lfqa-es">Generative Long-Form Question Answering Model</a>: For this we used either mT5 (the one attached) or <a href="https://huggingface.co/IIC/mbart-large-lfqa-es">mBART</a>. This generative model receives the most relevant passages and uses them to generate an answer to the question. In https://huggingface.co/IIC/mt5-base-lfqa-es and https://huggingface.co/IIC/mbart-large-lfqa-es there are more details about how we trained it etc.</li>
<li><a href="https://huggingface.co/facebook/tts_transformer-es-css10">Text2Speech</a>: For this we used Meta's text2speech service on Huggingface, as text2speech classes are not yet implemented on the main branch of Transformers. This piece was a must to provide a voice to voice service so that it's almost fully accessible. As future work, as soon as text2speech classes are implemented in transformers, we will train our own models to replace this piece.</li>
</ol>
Apart from those, this system could not respond in less than a minute on CPU if we didn't use some indexing tricks on the dataset, by using <a href="https://github.com/facebookresearch/faiss">Faiss</a>. We need to look for relevant passages to answer the questions on over 1.5M of semi-long documents, which means that if we want to compare the question vector as encoded by DPR against all of those vectors, we have to perform over 1.5M comparisons. Instead of that, we created a FAISS index optimized for very fast search, configured as follows:
<ul>
<li> A dimensionality reduction method is applied to to represent each one of the 1.5M documents as a vector of 128 elements, which after some quantization algorithms requires only 32 bytes of memory per vector.</li>
<li>Document vectors are clusted with k-means into about 5K clusters.</li>
<li>At query time, the query vector follow the same pipeline, and relevant documents from the same cluster are retrieved.</li>
</ul>
Using this strategy we managed to improve the passages retrieving time to miliseconds. This is key since large generative language models like the ones we use already take too much time on CPU, therefore we alleviate this restriction by reducing the retrieving time.
<h3 font-family: Georgia, serif;>
Datasets used and created
</h3>
We uploaded, and in some cases created, datasets in Spanish to be able to build such a system.
<ol>
<li><a href="https://huggingface.co/datasets/IIC/spanish_biomedical_crawled_corpus">Spanish Biomedical Crawled Corpus</a>. Used for finding answers to questions about biomedicine. (More info in https://huggingface.co/datasets/IIC/spanish_biomedical_crawled_corpus .)</li>
<li><a href="https://huggingface.co/datasets/IIC/lfqa_spanish">LFQA_Spanish</a>. Used for training the generative model. (More info in https://huggingface.co/datasets/IIC/lfqa_spanish )</li>
<li><a href="https://huggingface.co/datasets/squad_es">SQUADES</a>. Used to train the DPR models. (More info in https://huggingface.co/datasets/squad_es .)</li>
<li><a href="https://huggingface.co/datasets/IIC/bioasq22_es">BioAsq22-Spanish</a>. Used to train the DPR models. (More info in https://huggingface.co/datasets/IIC/bioasq22_es .)</li>
<li><a href="https://huggingface.co/datasets/PlanTL-GOB-ES/SQAC">SQAC (Spanish Question Answering Corpus)</a>. Used to train the DPR models. (More info in https://huggingface.co/datasets/PlanTL-GOB-ES/SQAC .)</li>
<li><a href="https://huggingface.co/datasets/IIC/msmarco_es">MSMARCO-ES</a>. Used to train CrossEncoder in Spanish for Ranker.(More info in https://huggingface.co/datasets/IIC/msmarco_es .)</li>
<li><a href="https://huggingface.co/datasets/multilingual_librispeech">MultiLibrispeech</a>. Used to train the Speech2Text model in Spanish. (More info in https://huggingface.co/datasets/multilingual_librispeech .)</li>
</ol>
<h3 font-family: Georgia, serif;>
<a href="https://www.un.org/sustainabledevelopment/es/objetivos-de-desarrollo-sostenible/">Objetivos del Desarrollo Sostenible</a>
</h3>
<ol>
<li><a href="https://www.un.org/sustainabledevelopment/es/health/">Salud y bienestar</a>: pretendemos con nuestro sistema mejorar la búsqueda de información acerca de la salud y el sector biomédico, ayudando tanto a investigadores biomédicos a indagar en una gran base de datos sobre el tema, pudiendo acelerar así el proceso de investigación y desarrollo en este ámbito, como a cualquier individuo que quiera conocer mejor acerca de la salud y de los temas relacionados. De esta manera usamos la IA para promover tanto el conocimiento como la exploración en el campo de la BioMedicina en castellano.</li>
<li><a href="https://www.un.org/sustainabledevelopment/es/education/">Educación de calidad</a>: al ofrecer al mundo un sistema avanzado de consulta de información, ayudamos a complementar y mejorar los sistemas de calidad actuales del mundo biomédico, pues los alumnos tienen un sistema para aprender sobre este campo interactuando a través de nuestros modelos con una gran base de conocimiento en este tema.</li>
<li><a href="https://www.un.org/sustainabledevelopment/es/inequality/">Reducción de las desigualdades</a>: Al hacer un sistema end-to-end de voz a voz, en el que no sería necesario usar el teclado (*), promovemos la accesibilidad a la herramienta. Esto tiene la intención de que personas que no puedan o padezcan impedimentos al leer o escribir tengan la oportunidad de interactuar con BioMedIA. Vimos la necesidad de hacer este sistema lo más flexible posible, para que fuera fácil interactuar con él independientemente de las dificultades o limitaciones físicas que pudieran tener las personas. Al incluir una salida de voz, aquellos que tengan problemas de visión también podrán recibir respuestas a sus dudas. Esto reduce las desigualdades de acceso a la herramienta de las personas con alguno de esos impedimentos. Además, generando una herramienta gratuita de acceso al conocimiento disponible en cualquier parte del mundo con acceso a Internet, reducimos las desigualdades de acceso a la información. </li>
</ol>
<h3 font-family: Georgia, serif;>
Contact
</h3>
<ul>
<li>Alejandro Vaca Serrano. <a href="https://www.linkedin.com/in/alejandro-vaca-serrano/">LinkedIn</a> </li>
<li>David Betancur Sánchez. <a href="https://www.linkedin.com/in/david-betancur-s%C3%A1nchez-714a79154/">LinkedIn</a> </li>
<li>Alba Segurado. <a href="https://www.linkedin.com/in/alba-segurado-data-science/">LinkedIn.</a> </li>
<li>Álvaro Barbero Jiménez. <a href="https://twitter.com/albarjip">Twitter </a></li>
<li>Guillem García Subies. <a href="https://www.linkedin.com/in/guillemgsubies/">LinkedIn</a> </li>
</ul>
</p>
(*) Nótese que en la demo actual del sistema el usuario necesita realizar una mínima interacción por teclado y ratón. Esto es debido a una limitación de diseño de los spaces de Huggingface. No obstante, las tecnologías desarrolladas sí permitirían su integración en un sistema de interacción pura por voz.
"""
# 1HOzvvgDLFNTK7tYAY1dRzNiLjH41fZks
# 1kvHDFUPPnf1kM5EKlv5Ife2KcZZvva_1
description = """
<meta charset="UTF-8">
<a href="https://www.iic.uam.es/">
<img src="https://drive.google.com/uc?export=view&id=1HOzvvgDLFNTK7tYAY1dRzNiLjH41fZks" style="max-width: 100%; max-height: 10%; height: 250px; object-fit: fill">
</a>
<h1 font-family: Georgia, serif;> BioMedIA: Abstractive Question Answering for the BioMedical Domain in Spanish </h1>
<p> Esta aplicación utiliza un avanzado sistema de búsqueda para obtener textos relevantes acerca de tu pregunta, usando toda esa información para tratar de condensarla en una explicación coherente y autocontenida. Más detalles y ejemplos de preguntas en la sección inferior.
El sistema generativo puede tardar entre 20 y 50 segundos en general, por lo que en esos ratos mientras esperas las respuestas, te invitamos a que bucees por el artículo que hemos dejado debajo de los ejemplos de la App, en el que podrás descubrir más detalles acerca de cómo funciona &#128214; &#129299;.
Los miembros del equipo:
<ul>
<li>Alejandro Vaca Serrano: <a href="https://huggingface.co/avacaondata">@avacaondata</a></li>
<li>David Betancur Sánchez: <a href="https://huggingface.co/Dabs">@Dabs</a></li>
<li>Alba Segurado: <a href="https://huggingface.co/alborotis">@alborotis</a></li>
<li>Álvaro Barbero Jiménez: <a href="https://huggingface.co/albarji">@albarji</a></li>
<li>Guillem García Subies: <a href="https://huggingface.co/GuillemGSubies">@GuillemGSubies</a></li>
</ul>
Esperamos que disfrutéis y curioseéis con ella &#128151; </p>
"""
examples = [
[
"¿Cuáles son los efectos secundarios más ampliamente reportados en el tratamiento de la enfermedad de Crohn?",
"vacio.flac",
"vacio.flac",
60,
8,
3,
1.0,
250,
False,
],
[
"¿Para qué sirve la tecnología CRISPR?",
"vacio.flac",
"vacio.flac",
60,
8,
3,
1.0,
250,
False,
],
[
"¿Qué es el lupus?",
"vacio.flac",
"vacio.flac",
60,
8,
3,
1.0,
250,
False,
],
[
"¿Qué es la anorexia?",
"vacio.flac",
"vacio.flac",
60,
8,
3,
1.0,
250,
False,
],
[
"¿Por qué sentimos ansiedad?",
"vacio.flac",
"vacio.flac",
50,
8,
3,
1.0,
250,
False,
],
[
"¿Qué es la gripe aviar?",
"vacio.flac",
"vacio.flac",
50,
8,
3,
1.0,
250,
False,
],
[
"¿Qué es la tecnología CRISPR?",
"vacio.flac",
"vacio.flac",
50,
8,
3,
1.0,
250,
False,
],
[
"¿Cómo se genera la apendicitis?",
"vacio.flac",
"vacio.flac",
50,
8,
3,
1.0,
250,
False,
],
[
"¿Qué es la mesoterapia?",
"vacio.flac",
"vacio.flac",
50,
8,
3,
1.0,
250,
False,
],
[
"¿Qué alternativas al Paracetamol existen para el dolor de cabeza?",
"vacio.flac",
"vacio.flac",
80,
8,
3,
1.0,
250,
False
],
[
"¿Cuáles son los principales tipos de disartria del trastorno del habla motor?",
"vacio.flac",
"vacio.flac",
50,
8,
3,
1.0,
250,
False
],
[
"¿Es la esclerosis tuberosa una enfermedad genética?",
"vacio.flac",
"vacio.flac",
50,
8,
3,
1.0,
250,
False
],
[
"¿Cuál es la función de la proteína Mis18?",
"vacio.flac",
"vacio.flac",
50,
8,
3,
1.0,
250,
False
],
[
"¿Cuáles son las principales causas de muerte?",
"vacio.flac",
"vacio.flac",
50,
8,
3,
1.0,
250,
False
],
[
"¿Qué deficiencia es la causa del síndrome de piernas inquietas?",
"vacio.flac",
"vacio.flac",
50,
8,
3,
1.0,
250,
False
],
[
"¿Cuál es la función del 6SRNA en las bacterias?",
"vacio.flac",
"vacio.flac",
60,
8,
3,
1.0,
250,
False,
],
[
"¿Por qué los humanos desarrollamos diabetes?",
"vacio.flac",
"vacio.flac",
50,
10,
3,
1.0,
250,
False,
],
[
"¿Qué factores de riesgo aumentan la probabilidad de sufrir un ataque al corazón?",
"vacio.flac",
"vacio.flac",
80,
8,
3,
1.0,
250,
False
],
[
"¿Cómo funcionan las vacunas?",
"vacio.flac",
"vacio.flac",
90,
8,
3,
1.0,
250,
False
],
[
"¿Tienen conciencia los animales?",
"vacio.flac",
"vacio.flac",
70,
8,
3,
1.0,
250,
False
],
]