Spaces:
Sleeping
Sleeping
update
Browse files
README.md
CHANGED
@@ -1,66 +1,12 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
- **Language Model**: Uses `Viet-Mistral/Vistral-7B-Chat`, a language model based on Mistral, with continued pretraining on Vietnamese for better generation performance.
|
14 |
-
|
15 |
-
## Installation
|
16 |
-
1. Clone the repository:
|
17 |
-
```sh
|
18 |
-
git clone https://github.com/quoctata2911/RAG-based-ChatBot-System.git
|
19 |
-
```
|
20 |
-
|
21 |
-
2. Navigate to the project directory:
|
22 |
-
```sh
|
23 |
-
cd RAG-Based-Chatbot-System
|
24 |
-
```
|
25 |
-
|
26 |
-
3. Install the required dependencies:
|
27 |
-
```sh
|
28 |
-
pip install -r requirements.txt
|
29 |
-
```
|
30 |
-
|
31 |
-
## Usage
|
32 |
-
Upload your Word .docx documents into the data folder. Ensure that each document has been chunked using a special chunk marker separator as specified in the config.yaml file.
|
33 |
-
|
34 |
-
1. Configure the chunk marker:
|
35 |
-
- Open the `config.yaml` file located in the project directory.
|
36 |
-
- Locate the parameter defining the chunk marker and adjust it as needed for your document segmentation requirements.
|
37 |
-
|
38 |
-
2. Prepare the data:
|
39 |
-
```sh
|
40 |
-
python prepare_data.py
|
41 |
-
```
|
42 |
-
3. Run the chatbot:
|
43 |
-
```sh
|
44 |
-
python chat.py
|
45 |
-
```
|
46 |
-
|
47 |
-
## Project Structure
|
48 |
-
- **prepare_data.py**: Script to preprocess and chunk documents, converting tables into HTML and segmenting them with chunk markers.
|
49 |
-
- **chat.py**: Main script to run the chatbot system.
|
50 |
-
|
51 |
-
## Models
|
52 |
-
- **Embedding Model**: We use the `intfloat/multilingual-e5-small` model for generating embeddings. This model is particularly effective for Vietnamese text, outperforming other models in our benchmarks.
|
53 |
-
|
54 |
-
- **Language Model**: The language model used is Vistral, a variant of the Mistral model that has been further pre-trained on Vietnamese text for improved performance in language generation tasks.
|
55 |
-
|
56 |
-
## Benchmarking and Performance
|
57 |
-
Through extensive benchmarking, the `intfloat/multilingual-e5-small` model has proven to be the best choice for Vietnamese embeddings, offering a balance of efficiency and performance. The Vistral model enhances language generation capabilities, ensuring the chatbot responds accurately and naturally in Vietnamese.
|
58 |
-
|
59 |
-
## Contributions
|
60 |
-
We welcome contributions to improve the RAG-ChatBot. Please fork the repository and create a pull request with your changes. For major changes, please open an issue first to discuss what you would like to change.
|
61 |
-
|
62 |
-
## License
|
63 |
-
This project is licensed under the MIT License. See the LICENSE file for more details.
|
64 |
-
|
65 |
-
## Contact
|
66 |
-
For any questions or suggestions, please contact me at [email protected]
|
|
|
1 |
+
---
|
2 |
+
title: GPT2 Vietnamese
|
3 |
+
emoji: 🚀
|
4 |
+
colorFrom: gray
|
5 |
+
colorTo: green
|
6 |
+
sdk: gradio
|
7 |
+
sdk_version: 4.21.0
|
8 |
+
app_file: app.py
|
9 |
+
pinned: false
|
10 |
+
---
|
11 |
+
|
12 |
+
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|