Initial version

Browse files

Files changed (5) hide show

.gitattributes +3 -0
README.md +66 -0
config.json +24 -0
documents +3 -0
embeddings +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+documents filter=lfs diff=lfs merge=lfs -text
+embeddings filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,66 @@

+---
+inference: false
+language: en
+license:
+  - cc0-1.0
+library_name: txtai
+tags:
+- sentence-similarity
+datasets:
+- arxiv_dataset
+---
+# arXiv txtai embeddings index
+This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [arXiv dataset](https://hf.co/datasets/arxiv_dataset) [metadata](https://info.arxiv.org/help/prep.html).
+txtai must be [installed](https://neuml.github.io/txtai/install/) to use this model.
+## Example
+This index can be loaded from the Hugging Face Hub with txtai as shown below.
+```python
+from txtai.embeddings import Embeddings
+# Load the index from the HF Hub
+embeddings = Embeddings()
+embeddings.load(provider="huggingface-hub", container="neuml/txtai-arxiv")
+# Run a search
+embeddings.search("txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.")
+```
+## Use Cases
+An embeddings index generated by txtai is a fully encapsulated index format. It doesn't require a database server or dependencies outside of the Python install.
+The arXiv index works well as a fact-based context source for retrieval augmented generation (RAG). In other words, search results from this model can be passed to LLM prompts as the context in which to answer questions.
+Additionally, this model can identify articles to cite in research. Passing a title + abstract pair will find similar existing articles.
+## Build the index
+The following steps show how to build this index.
+- Install required build dependencies
+```bash
+pip install txtchat datasets
+```
+- Follow these [instructions](https://huggingface.co/datasets/arxiv_dataset/blob/main/arxiv_dataset.py#L67) to download the dataset
+- Build txtai-arxiv index
+```bash
+python -m txtchat.data.arxiv.index \
+       -d <path to directory with file downloaded in previous step> \
+       -o txtai-arxiv
+```
+## More information
+See the following links for more information on the arXiv metadata dataset.
+- [Dataset on Hugging Face](https://huggingface.co/datasets/arxiv_dataset)
+- [Dataset on Kaggle](https://www.kaggle.com/datasets/Cornell-University/arxiv)
+- [Metadata description](https://info.arxiv.org/help/prep.html)

config.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "format": "json",
+  "path": "thenlper/gte-base",
+  "batch": 8192,
+  "encodebatch": 128,
+  "faiss": {
+    "quantize": true,
+    "sample": 0.05
+  },
+  "content": true,
+  "dimensions": 768,
+  "backend": "faiss",
+  "offset": 2399802,
+  "build": {
+    "create": "2024-01-15T06:00:38Z",
+    "python": "3.8.18",
+    "settings": {
+      "components": "IVF1386,SQ8"
+    },
+    "system": "Linux (x86_64)",
+    "txtai": "6.4.0"
+  },
+  "update": "2024-01-15T06:00:38Z"
+}

documents ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a4b751621223cf5a6a0331e2f89eaf5e7d622a5d07e1fe19112a0fc275ede3e8
+size 4296163328

embeddings ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:62f1464e1e69958255d959e7f03a7e7173592d62462d7783a30b4876c72a17b8
+size 1866521560