Spaces:
Sleeping
Sleeping
Tanmay09516
commited on
Upload 14 files
Browse files- .gitattributes +3 -0
- README.md +93 -12
- app.py +259 -0
- build_vectorstore.py +32 -0
- chain.py +46 -0
- embeddings.py +25 -0
- requirements.txt +9 -0
- retriever.py +70 -0
- vector_stores/docs_v1/index.faiss +3 -0
- vector_stores/docs_v1/index.pkl +3 -0
- vector_stores/docs_v2/index.faiss +3 -0
- vector_stores/docs_v2/index.pkl +3 -0
- vector_stores/docs_v3/index.faiss +3 -0
- vector_stores/docs_v3/index.pkl +3 -0
- vectorstore.py +118 -0
.gitattributes
CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
vector_stores/docs_v1/index.faiss filter=lfs diff=lfs merge=lfs -text
|
37 |
+
vector_stores/docs_v2/index.faiss filter=lfs diff=lfs merge=lfs -text
|
38 |
+
vector_stores/docs_v3/index.faiss filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
@@ -1,12 +1,93 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
**Lang-Chat** is a chatbot application designed to help users understand the [LangChain](https://langchain.com/) library and troubleshoot issues by leveraging scraped documentation and GitHub issues up to **November 22, 2024**. This project started as a personal tool to deepen our understanding of LangChain and to assist with common issues. During development, we discovered [Chat LangChain](https://chat.langchain.com/) (see [GitHub repository](https://github.com/langchain-ai/chat-langchain)), a more comprehensive implementation available freely.
|
3 |
+
|
4 |
+
Despite this, we decided to continue and complete our own version to contribute to the community and offer an alternative solution. **Lang-Chat** serves as a proof of concept and is actively being developed, with more features to be added over time.
|
5 |
+
|
6 |
+
![Lang-Chat Screenshot](screenshot.png) <!-- Optional: Add a screenshot of your app -->
|
7 |
+
|
8 |
+
## Inspiration and Concept
|
9 |
+
|
10 |
+
The idea behind Lang-Chat was to create a personalized assistant that could help us understand the LangChain library and address issues we encountered. By scraping the documentation and GitHub issues up to **November 22, 2024**, we aimed to build a comprehensive knowledge base. Discovering an existing solution like [Chat LangChain](https://chat.langchain.com/) motivated us to continue our project and potentially offer unique features or perspectives.
|
11 |
+
|
12 |
+
### Installation
|
13 |
+
|
14 |
+
1. **Clone the Repository**
|
15 |
+
|
16 |
+
```bash
|
17 |
+
git clone https://github.com/Tanmaydoesai/lang-chat.git
|
18 |
+
cd lang-chat
|
19 |
+
```
|
20 |
+
|
21 |
+
2. **Create a Virtual Environment**
|
22 |
+
|
23 |
+
It's recommended to use a virtual environment to manage dependencies.
|
24 |
+
|
25 |
+
```bash
|
26 |
+
python3 -m venv venv
|
27 |
+
source venv/bin/activate # On Windows: venv\Scripts\activate
|
28 |
+
```
|
29 |
+
|
30 |
+
3. **Install Dependencies**
|
31 |
+
|
32 |
+
```bash
|
33 |
+
pip install -r requirements.txt
|
34 |
+
```
|
35 |
+
|
36 |
+
4. **Set Up Environment Variables**
|
37 |
+
|
38 |
+
Create a `.env` file in the root directory by copying the `.env.template` file and filling in your API key.
|
39 |
+
|
40 |
+
```bash
|
41 |
+
cp .env.template .env
|
42 |
+
```
|
43 |
+
|
44 |
+
Edit the `.env` file and add your Groq API key:
|
45 |
+
|
46 |
+
```dotenv
|
47 |
+
GROQ_API_KEY="your_groq_api_key_here"
|
48 |
+
```
|
49 |
+
|
50 |
+
You can obtain a free API key from [https://groq.com/](https://groq.com/).
|
51 |
+
|
52 |
+
### The following steps are optional if you wish to scrape the documentation and issues again, otherwise the vector databases are provided
|
53 |
+
|
54 |
+
1. **Prepare the Data**
|
55 |
+
|
56 |
+
- **Documentation Files:** Place your LangChain documentation files in the `docs/` directory. Ensure they are in `.txt` format.
|
57 |
+
- **GitHub Issues:** Scrape and format GitHub issues from the LangChain repository up to **November 22, 2024**, into the `formatted_issues/` directory. Ensure they are in `.txt` format.
|
58 |
+
|
59 |
+
2. **Build Vector Stores**
|
60 |
+
|
61 |
+
Before running the application, build the vector stores from your documents and issues.
|
62 |
+
|
63 |
+
```bash
|
64 |
+
python build_vectorstore.py
|
65 |
+
```
|
66 |
+
|
67 |
+
This will process the documents and create vector stores in the `vector_stores/` directory.
|
68 |
+
|
69 |
+
### Running the Application
|
70 |
+
|
71 |
+
After the previous steps are completed start the Gradio interface by running:
|
72 |
+
|
73 |
+
```bash
|
74 |
+
python app.py
|
75 |
+
```
|
76 |
+
|
77 |
+
After running the command, you should see a local URL (e.g., `http://127.0.0.1:7860/`) in your terminal. Open this URL in your web browser to interact with the Lang-Chat chatbot.
|
78 |
+
|
79 |
+
### Usage
|
80 |
+
|
81 |
+
1. **Ask a Question:** Enter your question about LangChain in the "Your Question" textbox and press "Send" or hit Enter.
|
82 |
+
2. **View Chat History:** The chat history will display your questions and the assistant's responses.
|
83 |
+
3. **Explore Sources:** In the "Source Documents" section, select a source document from the dropdown to view the full content that the assistant referenced.
|
84 |
+
|
85 |
+
### Contributing
|
86 |
+
|
87 |
+
Lang-Chat is an actively developing proof of concept. Contributions are welcome! Please open issues or submit pull requests for improvements, bug fixes, or new features
|
88 |
+
|
89 |
+
|
90 |
+
Feel free to reach out or open an issue if you have any questions or suggestions!
|
91 |
+
|
92 |
+
|
93 |
+
---
|
app.py
ADDED
@@ -0,0 +1,259 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# app.py
|
2 |
+
|
3 |
+
import gradio as gr
|
4 |
+
from embeddings import init_embeddings
|
5 |
+
from vectorstore import load_all_vector_stores
|
6 |
+
from retriever import create_combined_retriever
|
7 |
+
from chain import init_conversational_chain
|
8 |
+
from langchain_groq import ChatGroq # Custom LLM class
|
9 |
+
from dotenv import load_dotenv
|
10 |
+
import os
|
11 |
+
import sys
|
12 |
+
|
13 |
+
# Disable parallelism warnings from tokenizers
|
14 |
+
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
15 |
+
|
16 |
+
def init_llm():
|
17 |
+
"""
|
18 |
+
Initialize the Language Model (LLM) using the ChatGroq class.
|
19 |
+
Loads environment variables from a .env file.
|
20 |
+
"""
|
21 |
+
load_dotenv()
|
22 |
+
llm = ChatGroq()
|
23 |
+
return llm
|
24 |
+
|
25 |
+
def setup():
|
26 |
+
"""
|
27 |
+
Set up the QA chain by initializing embeddings, loading vector stores,
|
28 |
+
creating a combined retriever, and initializing the conversational chain.
|
29 |
+
"""
|
30 |
+
embeddings = init_embeddings()
|
31 |
+
|
32 |
+
# Check if vector stores exist
|
33 |
+
if not os.path.exists("vector_stores") or not os.listdir("vector_stores"):
|
34 |
+
print("Vector stores not found. Please run 'build_vectorstore.py' first.")
|
35 |
+
sys.exit(1)
|
36 |
+
|
37 |
+
# Load all vector stores
|
38 |
+
vector_stores = load_all_vector_stores(embeddings)
|
39 |
+
|
40 |
+
# Create a combined retriever from all vector stores
|
41 |
+
retriever = create_combined_retriever(vector_stores)
|
42 |
+
|
43 |
+
# Initialize the LLM
|
44 |
+
llm = init_llm()
|
45 |
+
|
46 |
+
# Initialize the conversational QA chain
|
47 |
+
qa_chain = init_conversational_chain(llm, retriever)
|
48 |
+
return qa_chain
|
49 |
+
|
50 |
+
# Set up the QA chain
|
51 |
+
qa_chain = setup()
|
52 |
+
|
53 |
+
def format_source_doc(doc):
|
54 |
+
"""
|
55 |
+
Format a source document for display.
|
56 |
+
|
57 |
+
Args:
|
58 |
+
doc: A document object containing page_content and metadata.
|
59 |
+
|
60 |
+
Returns:
|
61 |
+
A dictionary with a preview, full content, and source.
|
62 |
+
"""
|
63 |
+
preview = doc.page_content[:150] + "..." # Short preview
|
64 |
+
source = doc.metadata.get('source', 'Unknown')
|
65 |
+
return {
|
66 |
+
"preview": preview,
|
67 |
+
"full_content": doc.page_content,
|
68 |
+
"source": source
|
69 |
+
}
|
70 |
+
|
71 |
+
def get_chat_history_tuples(history_messages):
|
72 |
+
"""
|
73 |
+
Convert the chat history from a list of message dictionaries to a list of tuples.
|
74 |
+
|
75 |
+
Args:
|
76 |
+
history_messages: List of message dictionaries with 'role' and 'content'.
|
77 |
+
|
78 |
+
Returns:
|
79 |
+
List of tuples in the form (user_message, assistant_message).
|
80 |
+
"""
|
81 |
+
chat_history_tuples = []
|
82 |
+
user_msg = None
|
83 |
+
assistant_msg = None
|
84 |
+
for msg in history_messages:
|
85 |
+
if msg['role'] == 'user':
|
86 |
+
if user_msg is not None:
|
87 |
+
# Append previous user message without assistant response
|
88 |
+
chat_history_tuples.append((user_msg, assistant_msg))
|
89 |
+
user_msg = msg['content']
|
90 |
+
assistant_msg = None
|
91 |
+
elif msg['role'] == 'assistant':
|
92 |
+
assistant_msg = msg['content']
|
93 |
+
chat_history_tuples.append((user_msg, assistant_msg))
|
94 |
+
user_msg = None
|
95 |
+
assistant_msg = None
|
96 |
+
# Append any remaining user message
|
97 |
+
if user_msg is not None:
|
98 |
+
chat_history_tuples.append((user_msg, assistant_msg))
|
99 |
+
return chat_history_tuples
|
100 |
+
|
101 |
+
def chatbot(message, history):
|
102 |
+
"""
|
103 |
+
Handle the chatbot interaction by invoking the QA chain and formatting the response.
|
104 |
+
|
105 |
+
Args:
|
106 |
+
message: The user's message.
|
107 |
+
history: The chat history.
|
108 |
+
|
109 |
+
Returns:
|
110 |
+
A tuple containing the assistant's answer and the list of source documents.
|
111 |
+
"""
|
112 |
+
# Convert history to list of tuples
|
113 |
+
if history is None:
|
114 |
+
history = []
|
115 |
+
chat_history = get_chat_history_tuples(history)
|
116 |
+
|
117 |
+
# Invoke the QA chain with the formatted history
|
118 |
+
response = qa_chain.invoke({
|
119 |
+
"question": message,
|
120 |
+
"chat_history": chat_history
|
121 |
+
})
|
122 |
+
|
123 |
+
# Format the response as a message dictionary
|
124 |
+
answer = {
|
125 |
+
"role": "assistant",
|
126 |
+
"content": response["answer"]
|
127 |
+
}
|
128 |
+
|
129 |
+
# Format source documents
|
130 |
+
source_docs = [format_source_doc(doc) for doc in response["source_documents"]]
|
131 |
+
|
132 |
+
return answer, source_docs
|
133 |
+
|
134 |
+
def show_popup(source_doc):
|
135 |
+
"""
|
136 |
+
Show a popup with the full content of the selected source document.
|
137 |
+
|
138 |
+
Args:
|
139 |
+
source_doc: The selected source document.
|
140 |
+
|
141 |
+
Returns:
|
142 |
+
An update object for the Gradio Textbox component.
|
143 |
+
"""
|
144 |
+
return gr.update(
|
145 |
+
value=f"Source: {source_doc['source']}\n\n{source_doc['full_content']}",
|
146 |
+
visible=True
|
147 |
+
)
|
148 |
+
|
149 |
+
# Define the Gradio Blocks interface
|
150 |
+
with gr.Blocks(css="""
|
151 |
+
.source-box { margin: 5px; padding: 10px; border: 1px solid #ddd; border-radius: 5px; }
|
152 |
+
.source-box:hover { background-color: #f5f5f5; cursor: pointer; }
|
153 |
+
""") as demo:
|
154 |
+
gr.Markdown("# Lang-Chat Chatbot")
|
155 |
+
|
156 |
+
with gr.Row():
|
157 |
+
with gr.Column(scale=7):
|
158 |
+
# Chat history component
|
159 |
+
chatbot_component = gr.Chatbot(
|
160 |
+
label="Chat History",
|
161 |
+
height=500,
|
162 |
+
bubble_full_width=False,
|
163 |
+
type="messages"
|
164 |
+
)
|
165 |
+
|
166 |
+
with gr.Row():
|
167 |
+
# Input textbox for user messages
|
168 |
+
msg = gr.Textbox(
|
169 |
+
label="Your Question",
|
170 |
+
placeholder="Ask me anything about LangChain...",
|
171 |
+
scale=8
|
172 |
+
)
|
173 |
+
# Submit button
|
174 |
+
submit = gr.Button("Send", scale=1)
|
175 |
+
|
176 |
+
with gr.Column(scale=3):
|
177 |
+
gr.Markdown("### Source Documents")
|
178 |
+
# Dropdown to select source documents
|
179 |
+
source_dropdown = gr.Dropdown(
|
180 |
+
label="Select a Source Document",
|
181 |
+
interactive=True
|
182 |
+
)
|
183 |
+
# Textbox to display full content of the selected document
|
184 |
+
popup = gr.Textbox(
|
185 |
+
label="Document Details",
|
186 |
+
interactive=False,
|
187 |
+
visible=False,
|
188 |
+
lines=10
|
189 |
+
)
|
190 |
+
# Hidden state to store source data
|
191 |
+
source_data_state = gr.State()
|
192 |
+
|
193 |
+
def process_message(message, history):
|
194 |
+
"""
|
195 |
+
Process the user's message, update chat history, and prepare source document options.
|
196 |
+
|
197 |
+
Args:
|
198 |
+
message: The user's message.
|
199 |
+
history: The current chat history.
|
200 |
+
|
201 |
+
Returns:
|
202 |
+
Updated chat history, updated source dropdown options, and updated source data state.
|
203 |
+
"""
|
204 |
+
if history is None:
|
205 |
+
history = []
|
206 |
+
answer, sources = chatbot(message, history)
|
207 |
+
|
208 |
+
# Append the new user message and assistant response to history
|
209 |
+
history.append({"role": "user", "content": message})
|
210 |
+
history.append(answer)
|
211 |
+
|
212 |
+
# Prepare options for the dropdown
|
213 |
+
source_options = []
|
214 |
+
for idx, source in enumerate(sources):
|
215 |
+
option_label = f"{idx+1}. {source['source']} - {source['preview'][:30]}..."
|
216 |
+
source_options.append(option_label)
|
217 |
+
|
218 |
+
# Store sources in state
|
219 |
+
source_data_state = sources
|
220 |
+
|
221 |
+
return history, gr.update(choices=source_options, value=None), source_data_state
|
222 |
+
|
223 |
+
# Define the submit action for both the textbox and the button
|
224 |
+
msg.submit(
|
225 |
+
process_message,
|
226 |
+
[msg, chatbot_component],
|
227 |
+
[chatbot_component, source_dropdown, source_data_state]
|
228 |
+
)
|
229 |
+
submit.click(
|
230 |
+
process_message,
|
231 |
+
[msg, chatbot_component],
|
232 |
+
[chatbot_component, source_dropdown, source_data_state]
|
233 |
+
)
|
234 |
+
|
235 |
+
def show_popup(selected_option, source_data_state):
|
236 |
+
"""
|
237 |
+
Display the full content of the selected source document in a popup.
|
238 |
+
|
239 |
+
Args:
|
240 |
+
selected_option: The selected option from the dropdown.
|
241 |
+
source_data_state: The list of source documents.
|
242 |
+
|
243 |
+
Returns:
|
244 |
+
An update object for the popup textbox.
|
245 |
+
"""
|
246 |
+
if selected_option is None:
|
247 |
+
return gr.update(visible=False)
|
248 |
+
sources = source_data_state
|
249 |
+
# Extract index from selected_option
|
250 |
+
idx = int(selected_option.split('.')[0]) - 1
|
251 |
+
source = sources[idx]
|
252 |
+
full_content = f"Source: {source['source']}\n\n{source['full_content']}"
|
253 |
+
return gr.update(value=full_content, visible=True)
|
254 |
+
|
255 |
+
# Define the change action for the dropdown
|
256 |
+
source_dropdown.change(show_popup, inputs=[source_dropdown, source_data_state], outputs=popup)
|
257 |
+
|
258 |
+
# Launch the Gradio interface
|
259 |
+
demo.launch()
|
build_vectorstore.py
ADDED
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# build_vectorstore.py
|
2 |
+
|
3 |
+
from embeddings import init_embeddings
|
4 |
+
from vectorstore import create_vector_stores, create_vector_store_from_folder
|
5 |
+
import os
|
6 |
+
|
7 |
+
def main():
|
8 |
+
"""
|
9 |
+
Main function to build vector stores from specified document paths and folders.
|
10 |
+
"""
|
11 |
+
# Initialize embeddings
|
12 |
+
embeddings = init_embeddings()
|
13 |
+
|
14 |
+
# List of document paths to process
|
15 |
+
doc_paths = [
|
16 |
+
"docs/docs_v1.txt",
|
17 |
+
"docs/docs_v2.txt",
|
18 |
+
"docs/docs_v3.txt"
|
19 |
+
]
|
20 |
+
|
21 |
+
# Create vector stores for individual documents
|
22 |
+
create_vector_stores(doc_paths, embeddings)
|
23 |
+
|
24 |
+
# Create vector store from the 'formatted_issues' folder
|
25 |
+
formatted_issues_folder = "formatted_issues"
|
26 |
+
if os.path.exists(formatted_issues_folder):
|
27 |
+
create_vector_store_from_folder(formatted_issues_folder, embeddings)
|
28 |
+
else:
|
29 |
+
print(f"Folder {formatted_issues_folder} does not exist. Skipping.")
|
30 |
+
|
31 |
+
if __name__ == "__main__":
|
32 |
+
main()
|
chain.py
ADDED
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# chain.py
|
2 |
+
|
3 |
+
from langchain.chains import ConversationalRetrievalChain
|
4 |
+
from langchain.memory import ConversationBufferMemory
|
5 |
+
from langchain.prompts import PromptTemplate
|
6 |
+
|
7 |
+
def init_conversational_chain(llm, retriever):
|
8 |
+
"""
|
9 |
+
Initialize the Conversational Retrieval Chain with memory and custom prompt.
|
10 |
+
|
11 |
+
Args:
|
12 |
+
llm: The language model to use.
|
13 |
+
retriever: The retriever to fetch relevant documents.
|
14 |
+
|
15 |
+
Returns:
|
16 |
+
An instance of ConversationalRetrievalChain.
|
17 |
+
"""
|
18 |
+
# Initialize conversation memory
|
19 |
+
memory = ConversationBufferMemory(
|
20 |
+
return_messages=True,
|
21 |
+
memory_key="chat_history",
|
22 |
+
output_key="answer"
|
23 |
+
)
|
24 |
+
|
25 |
+
# Define a custom prompt template
|
26 |
+
custom_prompt = PromptTemplate(
|
27 |
+
input_variables=["context", "question"],
|
28 |
+
template=(
|
29 |
+
"You are LangAssist, a knowledgeable assistant for the LangChain Python Library. "
|
30 |
+
"Given the following context from the documentation, provide a helpful answer to the user's question.\n\n"
|
31 |
+
"Context:\n{context}\n\n"
|
32 |
+
"Question: {question}\n\n"
|
33 |
+
"Answer:"
|
34 |
+
)
|
35 |
+
)
|
36 |
+
|
37 |
+
# Initialize the Conversational Retrieval Chain
|
38 |
+
qa_chain = ConversationalRetrievalChain.from_llm(
|
39 |
+
llm=llm,
|
40 |
+
retriever=retriever,
|
41 |
+
memory=memory,
|
42 |
+
return_source_documents=True,
|
43 |
+
combine_docs_chain_kwargs={"prompt": custom_prompt},
|
44 |
+
verbose=False
|
45 |
+
)
|
46 |
+
return qa_chain
|
embeddings.py
ADDED
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# embeddings.py
|
2 |
+
|
3 |
+
from langchain_huggingface import HuggingFaceEmbeddings
|
4 |
+
import torch
|
5 |
+
|
6 |
+
def init_embeddings():
|
7 |
+
"""
|
8 |
+
Initialize the HuggingFace embeddings model.
|
9 |
+
|
10 |
+
Returns:
|
11 |
+
An instance of HuggingFaceEmbeddings.
|
12 |
+
"""
|
13 |
+
model_name = "sentence-transformers/all-mpnet-base-v2"
|
14 |
+
model_kwargs = {
|
15 |
+
'device': 'cuda' if torch.cuda.is_available() else 'cpu'
|
16 |
+
}
|
17 |
+
encode_kwargs = {'normalize_embeddings': False}
|
18 |
+
|
19 |
+
embeddings = HuggingFaceEmbeddings(
|
20 |
+
model_name=model_name,
|
21 |
+
model_kwargs=model_kwargs,
|
22 |
+
encode_kwargs=encode_kwargs
|
23 |
+
)
|
24 |
+
|
25 |
+
return embeddings
|
requirements.txt
ADDED
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
faiss-cpu==1.9.0.post1
|
2 |
+
gradio==5.6.0
|
3 |
+
gradio_client==1.4.3
|
4 |
+
langchain==0.3.8
|
5 |
+
langchain-community==0.3.8
|
6 |
+
langchain-core==0.3.21
|
7 |
+
langchain-groq==0.2.1
|
8 |
+
langchain-huggingface==0.1.2
|
9 |
+
langchain-text-splitters==0.3.2
|
retriever.py
ADDED
@@ -0,0 +1,70 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# retriever.py
|
2 |
+
|
3 |
+
from langchain.schema import BaseRetriever
|
4 |
+
from typing import List
|
5 |
+
from pydantic import BaseModel
|
6 |
+
|
7 |
+
class CombinedRetriever(BaseRetriever):
|
8 |
+
"""
|
9 |
+
A retriever that combines multiple retrievers and returns the top K relevant documents.
|
10 |
+
"""
|
11 |
+
retrievers: List[BaseRetriever]
|
12 |
+
k: int = 5
|
13 |
+
|
14 |
+
def _get_relevant_documents(self, query: str):
|
15 |
+
"""
|
16 |
+
Retrieve relevant documents by querying all combined retrievers.
|
17 |
+
|
18 |
+
Args:
|
19 |
+
query: The search query string.
|
20 |
+
|
21 |
+
Returns:
|
22 |
+
A list of relevant documents.
|
23 |
+
"""
|
24 |
+
all_docs = []
|
25 |
+
for retriever in self.retrievers:
|
26 |
+
# Correctly invoke the retriever with the query string
|
27 |
+
docs = retriever.get_relevant_documents(query)
|
28 |
+
all_docs.extend(docs)
|
29 |
+
# Return the top K documents
|
30 |
+
return all_docs[:self.k]
|
31 |
+
|
32 |
+
async def _aget_relevant_documents(self, query: str):
|
33 |
+
"""
|
34 |
+
Asynchronously retrieve relevant documents by querying all combined retrievers.
|
35 |
+
|
36 |
+
Args:
|
37 |
+
query: The search query string.
|
38 |
+
|
39 |
+
Returns:
|
40 |
+
A list of relevant documents.
|
41 |
+
"""
|
42 |
+
all_docs = []
|
43 |
+
for retriever in self.retrievers:
|
44 |
+
# Correctly invoke the retriever with the query string
|
45 |
+
docs = await retriever.aget_relevant_documents(query)
|
46 |
+
all_docs.extend(docs)
|
47 |
+
# Return the top K documents
|
48 |
+
return all_docs[:self.k]
|
49 |
+
|
50 |
+
def create_combined_retriever(vector_stores, search_kwargs={"k": 3}):
|
51 |
+
"""
|
52 |
+
Create a CombinedRetriever from multiple vector stores.
|
53 |
+
|
54 |
+
Args:
|
55 |
+
vector_stores: A dictionary of vector stores.
|
56 |
+
search_kwargs: Keyword arguments for the retrievers (e.g., number of documents).
|
57 |
+
|
58 |
+
Returns:
|
59 |
+
An instance of CombinedRetriever.
|
60 |
+
"""
|
61 |
+
retrievers = [
|
62 |
+
vs.as_retriever(search_kwargs=search_kwargs)
|
63 |
+
for vs in vector_stores.values()
|
64 |
+
]
|
65 |
+
|
66 |
+
combined_retriever = CombinedRetriever(
|
67 |
+
retrievers=retrievers,
|
68 |
+
k=search_kwargs.get("k", 3)
|
69 |
+
)
|
70 |
+
return combined_retriever
|
vector_stores/docs_v1/index.faiss
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:7d5e5631b2446bc2de838d2dae70fa510cd9f8ebda6a28ce5306f7dd4ef29d9d
|
3 |
+
size 108309549
|
vector_stores/docs_v1/index.pkl
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:abd384e8ea668299122189a21b6e22b7f7e4a1310582763ccddd2c78844fc5b8
|
3 |
+
size 33772455
|
vector_stores/docs_v2/index.faiss
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:9a3f684205f3d92385757a17b5e5a7ff9f6357a80b1ac2c441c912a1da5c1323
|
3 |
+
size 92697645
|
vector_stores/docs_v2/index.pkl
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:44b5dad15385ae447d401cfb7f69fdf52ca4b352bd414e2b9560816e23b8fea1
|
3 |
+
size 30836096
|
vector_stores/docs_v3/index.faiss
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:b4e53821ff162bd3b9beaedf6871866e620fb97931e7e90f477b7761cdd21b62
|
3 |
+
size 49499181
|
vector_stores/docs_v3/index.pkl
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:2bb8ff41c4c4a97f26bb0a456a4f3e18518927d2ee4b4623a7aa595f780d8a82
|
3 |
+
size 15370601
|
vectorstore.py
ADDED
@@ -0,0 +1,118 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# vectorstore.py
|
2 |
+
|
3 |
+
import os
|
4 |
+
from langchain_community.document_loaders import TextLoader
|
5 |
+
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
6 |
+
from langchain_community.vectorstores import FAISS
|
7 |
+
|
8 |
+
def load_and_split_document(file_path, chunk_size=1000, chunk_overlap=150):
|
9 |
+
"""
|
10 |
+
Load a document from a file and split it into chunks.
|
11 |
+
|
12 |
+
Args:
|
13 |
+
file_path: Path to the text file.
|
14 |
+
chunk_size: The maximum size of each chunk.
|
15 |
+
chunk_overlap: The overlap between chunks.
|
16 |
+
|
17 |
+
Returns:
|
18 |
+
A list of document chunks.
|
19 |
+
"""
|
20 |
+
loader = TextLoader(
|
21 |
+
file_path,
|
22 |
+
encoding='utf-8',
|
23 |
+
autodetect_encoding=True
|
24 |
+
)
|
25 |
+
|
26 |
+
try:
|
27 |
+
documents = loader.load()
|
28 |
+
except RuntimeError:
|
29 |
+
# Fallback to a different encoding if autodetection fails
|
30 |
+
loader = TextLoader(
|
31 |
+
file_path,
|
32 |
+
encoding='latin-1',
|
33 |
+
autodetect_encoding=False
|
34 |
+
)
|
35 |
+
documents = loader.load()
|
36 |
+
|
37 |
+
text_splitter = RecursiveCharacterTextSplitter(
|
38 |
+
chunk_size=chunk_size,
|
39 |
+
chunk_overlap=chunk_overlap,
|
40 |
+
length_function=len
|
41 |
+
)
|
42 |
+
|
43 |
+
chunks = text_splitter.split_documents(documents)
|
44 |
+
return chunks
|
45 |
+
|
46 |
+
def create_vector_stores(doc_paths, embeddings):
|
47 |
+
"""
|
48 |
+
Create vector stores from a list of document paths.
|
49 |
+
|
50 |
+
Args:
|
51 |
+
doc_paths: List of paths to document files.
|
52 |
+
embeddings: The embeddings model to use.
|
53 |
+
|
54 |
+
Returns:
|
55 |
+
A dictionary of vector stores.
|
56 |
+
"""
|
57 |
+
vector_stores = {}
|
58 |
+
os.makedirs("vector_stores", exist_ok=True)
|
59 |
+
|
60 |
+
for doc_path in doc_paths:
|
61 |
+
store_name = os.path.basename(doc_path).split('.')[0]
|
62 |
+
chunks = load_and_split_document(doc_path)
|
63 |
+
print(f"Processing {store_name}: {len(chunks)} chunks created")
|
64 |
+
vectorstore = FAISS.from_documents(chunks, embeddings)
|
65 |
+
vectorstore.save_local(f"vector_stores/{store_name}")
|
66 |
+
vector_stores[store_name] = vectorstore
|
67 |
+
|
68 |
+
return vector_stores
|
69 |
+
|
70 |
+
def create_vector_store_from_folder(folder_path, embeddings):
|
71 |
+
"""
|
72 |
+
Create a single vector store from all text files in a folder.
|
73 |
+
|
74 |
+
Args:
|
75 |
+
folder_path: Path to the folder containing text files.
|
76 |
+
embeddings: The embeddings model to use.
|
77 |
+
|
78 |
+
Returns:
|
79 |
+
A dictionary containing the created vector store.
|
80 |
+
"""
|
81 |
+
vector_stores = {}
|
82 |
+
os.makedirs("vector_stores", exist_ok=True)
|
83 |
+
all_chunks = []
|
84 |
+
file_names = []
|
85 |
+
|
86 |
+
for filename in os.listdir(folder_path):
|
87 |
+
if filename.endswith(".txt"):
|
88 |
+
file_path = os.path.join(folder_path, filename)
|
89 |
+
chunks = load_and_split_document(file_path)
|
90 |
+
all_chunks.extend(chunks)
|
91 |
+
file_names.append(filename)
|
92 |
+
|
93 |
+
print(f"Processing {folder_path}: {len(all_chunks)} chunks created from {len(file_names)} files")
|
94 |
+
vectorstore = FAISS.from_documents(all_chunks, embeddings)
|
95 |
+
store_name = os.path.basename(folder_path.rstrip('/'))
|
96 |
+
vectorstore.save_local(f"vector_stores/{store_name}")
|
97 |
+
vector_stores[store_name] = vectorstore
|
98 |
+
|
99 |
+
return vector_stores
|
100 |
+
|
101 |
+
def load_all_vector_stores(embeddings):
|
102 |
+
"""
|
103 |
+
Load all vector stores from the 'vector_stores' directory.
|
104 |
+
|
105 |
+
Args:
|
106 |
+
embeddings: The embeddings model to use.
|
107 |
+
|
108 |
+
Returns:
|
109 |
+
A dictionary of loaded vector stores.
|
110 |
+
"""
|
111 |
+
vector_stores = {}
|
112 |
+
store_dir = "vector_stores"
|
113 |
+
|
114 |
+
for store_name in os.listdir(store_dir):
|
115 |
+
store_path = os.path.join(store_dir, store_name)
|
116 |
+
if os.path.isdir(store_path):
|
117 |
+
vector_stores[store_name] = FAISS.load_local(store_path, embeddings, allow_dangerous_deserialization=True)
|
118 |
+
return vector_stores
|