Spaces:
Running
Running
anakin87
commited on
Commit
·
bcb986c
1
Parent(s):
321ba78
progress in readme
Browse files- README.md +19 -0
- app_utils/frontend_utils.py +9 -5
- data/statements.txt +2 -1
- pages/Info.py +3 -0
README.md
CHANGED
@@ -11,3 +11,22 @@ license: apache-2.0
|
|
11 |
---
|
12 |
|
13 |
# Fact Checking rocks! [![Generic badge](https://img.shields.io/badge/🤗-Open%20in%20Spaces-blue.svg)](https://huggingface.co/spaces/anakin87/fact-checking-rocks) [![Generic badge](https://img.shields.io/github/stars/anakin87/fact-checking-rocks?label=Github&style=social)](https://github.com/anakin87/fact-checking-rocks)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
---
|
12 |
|
13 |
# Fact Checking rocks! [![Generic badge](https://img.shields.io/badge/🤗-Open%20in%20Spaces-blue.svg)](https://huggingface.co/spaces/anakin87/fact-checking-rocks) [![Generic badge](https://img.shields.io/github/stars/anakin87/fact-checking-rocks?label=Github&style=social)](https://github.com/anakin87/fact-checking-rocks)
|
14 |
+
|
15 |
+
## *Fact checking baseline combining dense retrieval and textual entailment*
|
16 |
+
|
17 |
+
### Idea 💡
|
18 |
+
This project aims to show that a naive and simple baseline for fact checking can be built by combining dense retrieval and a textual entailment task (based on Natural Language Inference models).
|
19 |
+
|
20 |
+
### System description
|
21 |
+
This project is strongly based on [Haystack](https://github.com/deepset-ai/haystack), an open source NLP framework to realize search system. The main components of our system are an indexing pipeline and a search pipeline.
|
22 |
+
|
23 |
+
#### Indexing pipeline
|
24 |
+
* [Crawling](https://github.com/anakin87/fact-checking-rocks/blob/321ba7893bbe79582f8c052493acfda497c5b785/notebooks/get_wikipedia_data.ipynb): Crawl data from Wikipedia, starting from the page [List of mainstream rock performers](https://en.wikipedia.org/wiki/List_of_mainstream_rock_performers) and using the [python wrapper](https://github.com/goldsmith/Wikipedia)
|
25 |
+
* [Indexing through Haystack](https://github.com/anakin87/fact-checking-rocks/blob/321ba7893bbe79582f8c052493acfda497c5b785/notebooks/indexing.ipynb)
|
26 |
+
* Preprocess the downloaded documents into chunks consisting of 2 sentences
|
27 |
+
* Chunks with less than 10 words are discarded, because not very informative
|
28 |
+
* Instantiate a [FAISS](https://github.com/facebookresearch/faiss) Document store and store the passages on it
|
29 |
+
* Create embeddings for the passages, using a Sentence Transformer model and save them in FAISS. It seems that the retrieval task will involve [*asymmetric semantic search*](https://www.sbert.net/examples/applications/semantic-search/README.html#symmetric-vs-asymmetric-semantic-search) (statements to be verified are usually shorter than inherent passages), therefore I choose the model `msmarco-distilbert-base-tas-b`.
|
30 |
+
* Save FAISS index
|
31 |
+
|
32 |
+
#### Search pipeline
|
app_utils/frontend_utils.py
CHANGED
@@ -10,11 +10,15 @@ entailment_html_messages = {
|
|
10 |
}
|
11 |
|
12 |
def build_sidebar():
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
|
|
|
|
|
|
|
|
18 |
|
19 |
def set_state_if_absent(key, value):
|
20 |
if key not in st.session_state:
|
|
|
10 |
}
|
11 |
|
12 |
def build_sidebar():
|
13 |
+
sidebar="""
|
14 |
+
<h1 style='text-align: center'>Fact checking 🎸 Rocks!</h1>
|
15 |
+
<div style='text-align: center'>
|
16 |
+
<i>Fact checking baseline combining dense retrieval and textual entailment</i>
|
17 |
+
<p><br/><a href='https://github.com/anakin87/fact-checking-rocks'>Github project</a> - Based on <a href='https://github.com/deepset-ai/haystack'>Haystack</a></p>
|
18 |
+
<p><small><a href='https://en.wikipedia.org/wiki/List_of_mainstream_rock_performers'>Data crawled from Wikipedia</a></small></p>
|
19 |
+
</div>
|
20 |
+
"""
|
21 |
+
st.sidebar.markdown(sidebar, unsafe_allow_html=True)
|
22 |
|
23 |
def set_state_if_absent(key, value):
|
24 |
if key not in st.session_state:
|
data/statements.txt
CHANGED
@@ -45,4 +45,5 @@ The Cure made dark songs
|
|
45 |
Cannibal Corpse is a pop punk band
|
46 |
Slipknot wear masks
|
47 |
Toto have sold many records
|
48 |
-
The verve were a British band
|
|
|
|
45 |
Cannibal Corpse is a pop punk band
|
46 |
Slipknot wear masks
|
47 |
Toto have sold many records
|
48 |
+
The verve were a British band
|
49 |
+
Psychokiller is a hit by Talking Heads
|
pages/Info.py
CHANGED
@@ -1 +1,4 @@
|
|
1 |
import streamlit as st
|
|
|
|
|
|
|
|
1 |
import streamlit as st
|
2 |
+
from app_utils.frontend_utils import build_sidebar
|
3 |
+
|
4 |
+
build_sidebar()
|