Tanmay09516 commited on
Commit
dd87c4b
·
verified ·
1 Parent(s): 14c52b8

Upload 14 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ vector_stores/docs_v1/index.faiss filter=lfs diff=lfs merge=lfs -text
37
+ vector_stores/docs_v2/index.faiss filter=lfs diff=lfs merge=lfs -text
38
+ vector_stores/docs_v3/index.faiss filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,12 +1,93 @@
1
- ---
2
- title: Langchat
3
- emoji: 🐢
4
- colorFrom: yellow
5
- colorTo: indigo
6
- sdk: gradio
7
- sdk_version: 5.6.0
8
- app_file: app.py
9
- pinned: false
10
- ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ **Lang-Chat** is a chatbot application designed to help users understand the [LangChain](https://langchain.com/) library and troubleshoot issues by leveraging scraped documentation and GitHub issues up to **November 22, 2024**. This project started as a personal tool to deepen our understanding of LangChain and to assist with common issues. During development, we discovered [Chat LangChain](https://chat.langchain.com/) (see [GitHub repository](https://github.com/langchain-ai/chat-langchain)), a more comprehensive implementation available freely.
3
+
4
+ Despite this, we decided to continue and complete our own version to contribute to the community and offer an alternative solution. **Lang-Chat** serves as a proof of concept and is actively being developed, with more features to be added over time.
5
+
6
+ ![Lang-Chat Screenshot](screenshot.png) <!-- Optional: Add a screenshot of your app -->
7
+
8
+ ## Inspiration and Concept
9
+
10
+ The idea behind Lang-Chat was to create a personalized assistant that could help us understand the LangChain library and address issues we encountered. By scraping the documentation and GitHub issues up to **November 22, 2024**, we aimed to build a comprehensive knowledge base. Discovering an existing solution like [Chat LangChain](https://chat.langchain.com/) motivated us to continue our project and potentially offer unique features or perspectives.
11
+
12
+ ### Installation
13
+
14
+ 1. **Clone the Repository**
15
+
16
+ ```bash
17
+ git clone https://github.com/Tanmaydoesai/lang-chat.git
18
+ cd lang-chat
19
+ ```
20
+
21
+ 2. **Create a Virtual Environment**
22
+
23
+ It's recommended to use a virtual environment to manage dependencies.
24
+
25
+ ```bash
26
+ python3 -m venv venv
27
+ source venv/bin/activate # On Windows: venv\Scripts\activate
28
+ ```
29
+
30
+ 3. **Install Dependencies**
31
+
32
+ ```bash
33
+ pip install -r requirements.txt
34
+ ```
35
+
36
+ 4. **Set Up Environment Variables**
37
+
38
+ Create a `.env` file in the root directory by copying the `.env.template` file and filling in your API key.
39
+
40
+ ```bash
41
+ cp .env.template .env
42
+ ```
43
+
44
+ Edit the `.env` file and add your Groq API key:
45
+
46
+ ```dotenv
47
+ GROQ_API_KEY="your_groq_api_key_here"
48
+ ```
49
+
50
+ You can obtain a free API key from [https://groq.com/](https://groq.com/).
51
+
52
+ ### The following steps are optional if you wish to scrape the documentation and issues again, otherwise the vector databases are provided
53
+
54
+ 1. **Prepare the Data**
55
+
56
+ - **Documentation Files:** Place your LangChain documentation files in the `docs/` directory. Ensure they are in `.txt` format.
57
+ - **GitHub Issues:** Scrape and format GitHub issues from the LangChain repository up to **November 22, 2024**, into the `formatted_issues/` directory. Ensure they are in `.txt` format.
58
+
59
+ 2. **Build Vector Stores**
60
+
61
+ Before running the application, build the vector stores from your documents and issues.
62
+
63
+ ```bash
64
+ python build_vectorstore.py
65
+ ```
66
+
67
+ This will process the documents and create vector stores in the `vector_stores/` directory.
68
+
69
+ ### Running the Application
70
+
71
+ After the previous steps are completed start the Gradio interface by running:
72
+
73
+ ```bash
74
+ python app.py
75
+ ```
76
+
77
+ After running the command, you should see a local URL (e.g., `http://127.0.0.1:7860/`) in your terminal. Open this URL in your web browser to interact with the Lang-Chat chatbot.
78
+
79
+ ### Usage
80
+
81
+ 1. **Ask a Question:** Enter your question about LangChain in the "Your Question" textbox and press "Send" or hit Enter.
82
+ 2. **View Chat History:** The chat history will display your questions and the assistant's responses.
83
+ 3. **Explore Sources:** In the "Source Documents" section, select a source document from the dropdown to view the full content that the assistant referenced.
84
+
85
+ ### Contributing
86
+
87
+ Lang-Chat is an actively developing proof of concept. Contributions are welcome! Please open issues or submit pull requests for improvements, bug fixes, or new features
88
+
89
+
90
+ Feel free to reach out or open an issue if you have any questions or suggestions!
91
+
92
+
93
+ ---
app.py ADDED
@@ -0,0 +1,259 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # app.py
2
+
3
+ import gradio as gr
4
+ from embeddings import init_embeddings
5
+ from vectorstore import load_all_vector_stores
6
+ from retriever import create_combined_retriever
7
+ from chain import init_conversational_chain
8
+ from langchain_groq import ChatGroq # Custom LLM class
9
+ from dotenv import load_dotenv
10
+ import os
11
+ import sys
12
+
13
+ # Disable parallelism warnings from tokenizers
14
+ os.environ["TOKENIZERS_PARALLELISM"] = "false"
15
+
16
+ def init_llm():
17
+ """
18
+ Initialize the Language Model (LLM) using the ChatGroq class.
19
+ Loads environment variables from a .env file.
20
+ """
21
+ load_dotenv()
22
+ llm = ChatGroq()
23
+ return llm
24
+
25
+ def setup():
26
+ """
27
+ Set up the QA chain by initializing embeddings, loading vector stores,
28
+ creating a combined retriever, and initializing the conversational chain.
29
+ """
30
+ embeddings = init_embeddings()
31
+
32
+ # Check if vector stores exist
33
+ if not os.path.exists("vector_stores") or not os.listdir("vector_stores"):
34
+ print("Vector stores not found. Please run 'build_vectorstore.py' first.")
35
+ sys.exit(1)
36
+
37
+ # Load all vector stores
38
+ vector_stores = load_all_vector_stores(embeddings)
39
+
40
+ # Create a combined retriever from all vector stores
41
+ retriever = create_combined_retriever(vector_stores)
42
+
43
+ # Initialize the LLM
44
+ llm = init_llm()
45
+
46
+ # Initialize the conversational QA chain
47
+ qa_chain = init_conversational_chain(llm, retriever)
48
+ return qa_chain
49
+
50
+ # Set up the QA chain
51
+ qa_chain = setup()
52
+
53
+ def format_source_doc(doc):
54
+ """
55
+ Format a source document for display.
56
+
57
+ Args:
58
+ doc: A document object containing page_content and metadata.
59
+
60
+ Returns:
61
+ A dictionary with a preview, full content, and source.
62
+ """
63
+ preview = doc.page_content[:150] + "..." # Short preview
64
+ source = doc.metadata.get('source', 'Unknown')
65
+ return {
66
+ "preview": preview,
67
+ "full_content": doc.page_content,
68
+ "source": source
69
+ }
70
+
71
+ def get_chat_history_tuples(history_messages):
72
+ """
73
+ Convert the chat history from a list of message dictionaries to a list of tuples.
74
+
75
+ Args:
76
+ history_messages: List of message dictionaries with 'role' and 'content'.
77
+
78
+ Returns:
79
+ List of tuples in the form (user_message, assistant_message).
80
+ """
81
+ chat_history_tuples = []
82
+ user_msg = None
83
+ assistant_msg = None
84
+ for msg in history_messages:
85
+ if msg['role'] == 'user':
86
+ if user_msg is not None:
87
+ # Append previous user message without assistant response
88
+ chat_history_tuples.append((user_msg, assistant_msg))
89
+ user_msg = msg['content']
90
+ assistant_msg = None
91
+ elif msg['role'] == 'assistant':
92
+ assistant_msg = msg['content']
93
+ chat_history_tuples.append((user_msg, assistant_msg))
94
+ user_msg = None
95
+ assistant_msg = None
96
+ # Append any remaining user message
97
+ if user_msg is not None:
98
+ chat_history_tuples.append((user_msg, assistant_msg))
99
+ return chat_history_tuples
100
+
101
+ def chatbot(message, history):
102
+ """
103
+ Handle the chatbot interaction by invoking the QA chain and formatting the response.
104
+
105
+ Args:
106
+ message: The user's message.
107
+ history: The chat history.
108
+
109
+ Returns:
110
+ A tuple containing the assistant's answer and the list of source documents.
111
+ """
112
+ # Convert history to list of tuples
113
+ if history is None:
114
+ history = []
115
+ chat_history = get_chat_history_tuples(history)
116
+
117
+ # Invoke the QA chain with the formatted history
118
+ response = qa_chain.invoke({
119
+ "question": message,
120
+ "chat_history": chat_history
121
+ })
122
+
123
+ # Format the response as a message dictionary
124
+ answer = {
125
+ "role": "assistant",
126
+ "content": response["answer"]
127
+ }
128
+
129
+ # Format source documents
130
+ source_docs = [format_source_doc(doc) for doc in response["source_documents"]]
131
+
132
+ return answer, source_docs
133
+
134
+ def show_popup(source_doc):
135
+ """
136
+ Show a popup with the full content of the selected source document.
137
+
138
+ Args:
139
+ source_doc: The selected source document.
140
+
141
+ Returns:
142
+ An update object for the Gradio Textbox component.
143
+ """
144
+ return gr.update(
145
+ value=f"Source: {source_doc['source']}\n\n{source_doc['full_content']}",
146
+ visible=True
147
+ )
148
+
149
+ # Define the Gradio Blocks interface
150
+ with gr.Blocks(css="""
151
+ .source-box { margin: 5px; padding: 10px; border: 1px solid #ddd; border-radius: 5px; }
152
+ .source-box:hover { background-color: #f5f5f5; cursor: pointer; }
153
+ """) as demo:
154
+ gr.Markdown("# Lang-Chat Chatbot")
155
+
156
+ with gr.Row():
157
+ with gr.Column(scale=7):
158
+ # Chat history component
159
+ chatbot_component = gr.Chatbot(
160
+ label="Chat History",
161
+ height=500,
162
+ bubble_full_width=False,
163
+ type="messages"
164
+ )
165
+
166
+ with gr.Row():
167
+ # Input textbox for user messages
168
+ msg = gr.Textbox(
169
+ label="Your Question",
170
+ placeholder="Ask me anything about LangChain...",
171
+ scale=8
172
+ )
173
+ # Submit button
174
+ submit = gr.Button("Send", scale=1)
175
+
176
+ with gr.Column(scale=3):
177
+ gr.Markdown("### Source Documents")
178
+ # Dropdown to select source documents
179
+ source_dropdown = gr.Dropdown(
180
+ label="Select a Source Document",
181
+ interactive=True
182
+ )
183
+ # Textbox to display full content of the selected document
184
+ popup = gr.Textbox(
185
+ label="Document Details",
186
+ interactive=False,
187
+ visible=False,
188
+ lines=10
189
+ )
190
+ # Hidden state to store source data
191
+ source_data_state = gr.State()
192
+
193
+ def process_message(message, history):
194
+ """
195
+ Process the user's message, update chat history, and prepare source document options.
196
+
197
+ Args:
198
+ message: The user's message.
199
+ history: The current chat history.
200
+
201
+ Returns:
202
+ Updated chat history, updated source dropdown options, and updated source data state.
203
+ """
204
+ if history is None:
205
+ history = []
206
+ answer, sources = chatbot(message, history)
207
+
208
+ # Append the new user message and assistant response to history
209
+ history.append({"role": "user", "content": message})
210
+ history.append(answer)
211
+
212
+ # Prepare options for the dropdown
213
+ source_options = []
214
+ for idx, source in enumerate(sources):
215
+ option_label = f"{idx+1}. {source['source']} - {source['preview'][:30]}..."
216
+ source_options.append(option_label)
217
+
218
+ # Store sources in state
219
+ source_data_state = sources
220
+
221
+ return history, gr.update(choices=source_options, value=None), source_data_state
222
+
223
+ # Define the submit action for both the textbox and the button
224
+ msg.submit(
225
+ process_message,
226
+ [msg, chatbot_component],
227
+ [chatbot_component, source_dropdown, source_data_state]
228
+ )
229
+ submit.click(
230
+ process_message,
231
+ [msg, chatbot_component],
232
+ [chatbot_component, source_dropdown, source_data_state]
233
+ )
234
+
235
+ def show_popup(selected_option, source_data_state):
236
+ """
237
+ Display the full content of the selected source document in a popup.
238
+
239
+ Args:
240
+ selected_option: The selected option from the dropdown.
241
+ source_data_state: The list of source documents.
242
+
243
+ Returns:
244
+ An update object for the popup textbox.
245
+ """
246
+ if selected_option is None:
247
+ return gr.update(visible=False)
248
+ sources = source_data_state
249
+ # Extract index from selected_option
250
+ idx = int(selected_option.split('.')[0]) - 1
251
+ source = sources[idx]
252
+ full_content = f"Source: {source['source']}\n\n{source['full_content']}"
253
+ return gr.update(value=full_content, visible=True)
254
+
255
+ # Define the change action for the dropdown
256
+ source_dropdown.change(show_popup, inputs=[source_dropdown, source_data_state], outputs=popup)
257
+
258
+ # Launch the Gradio interface
259
+ demo.launch()
build_vectorstore.py ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # build_vectorstore.py
2
+
3
+ from embeddings import init_embeddings
4
+ from vectorstore import create_vector_stores, create_vector_store_from_folder
5
+ import os
6
+
7
+ def main():
8
+ """
9
+ Main function to build vector stores from specified document paths and folders.
10
+ """
11
+ # Initialize embeddings
12
+ embeddings = init_embeddings()
13
+
14
+ # List of document paths to process
15
+ doc_paths = [
16
+ "docs/docs_v1.txt",
17
+ "docs/docs_v2.txt",
18
+ "docs/docs_v3.txt"
19
+ ]
20
+
21
+ # Create vector stores for individual documents
22
+ create_vector_stores(doc_paths, embeddings)
23
+
24
+ # Create vector store from the 'formatted_issues' folder
25
+ formatted_issues_folder = "formatted_issues"
26
+ if os.path.exists(formatted_issues_folder):
27
+ create_vector_store_from_folder(formatted_issues_folder, embeddings)
28
+ else:
29
+ print(f"Folder {formatted_issues_folder} does not exist. Skipping.")
30
+
31
+ if __name__ == "__main__":
32
+ main()
chain.py ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # chain.py
2
+
3
+ from langchain.chains import ConversationalRetrievalChain
4
+ from langchain.memory import ConversationBufferMemory
5
+ from langchain.prompts import PromptTemplate
6
+
7
+ def init_conversational_chain(llm, retriever):
8
+ """
9
+ Initialize the Conversational Retrieval Chain with memory and custom prompt.
10
+
11
+ Args:
12
+ llm: The language model to use.
13
+ retriever: The retriever to fetch relevant documents.
14
+
15
+ Returns:
16
+ An instance of ConversationalRetrievalChain.
17
+ """
18
+ # Initialize conversation memory
19
+ memory = ConversationBufferMemory(
20
+ return_messages=True,
21
+ memory_key="chat_history",
22
+ output_key="answer"
23
+ )
24
+
25
+ # Define a custom prompt template
26
+ custom_prompt = PromptTemplate(
27
+ input_variables=["context", "question"],
28
+ template=(
29
+ "You are LangAssist, a knowledgeable assistant for the LangChain Python Library. "
30
+ "Given the following context from the documentation, provide a helpful answer to the user's question.\n\n"
31
+ "Context:\n{context}\n\n"
32
+ "Question: {question}\n\n"
33
+ "Answer:"
34
+ )
35
+ )
36
+
37
+ # Initialize the Conversational Retrieval Chain
38
+ qa_chain = ConversationalRetrievalChain.from_llm(
39
+ llm=llm,
40
+ retriever=retriever,
41
+ memory=memory,
42
+ return_source_documents=True,
43
+ combine_docs_chain_kwargs={"prompt": custom_prompt},
44
+ verbose=False
45
+ )
46
+ return qa_chain
embeddings.py ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # embeddings.py
2
+
3
+ from langchain_huggingface import HuggingFaceEmbeddings
4
+ import torch
5
+
6
+ def init_embeddings():
7
+ """
8
+ Initialize the HuggingFace embeddings model.
9
+
10
+ Returns:
11
+ An instance of HuggingFaceEmbeddings.
12
+ """
13
+ model_name = "sentence-transformers/all-mpnet-base-v2"
14
+ model_kwargs = {
15
+ 'device': 'cuda' if torch.cuda.is_available() else 'cpu'
16
+ }
17
+ encode_kwargs = {'normalize_embeddings': False}
18
+
19
+ embeddings = HuggingFaceEmbeddings(
20
+ model_name=model_name,
21
+ model_kwargs=model_kwargs,
22
+ encode_kwargs=encode_kwargs
23
+ )
24
+
25
+ return embeddings
requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ faiss-cpu==1.9.0.post1
2
+ gradio==5.6.0
3
+ gradio_client==1.4.3
4
+ langchain==0.3.8
5
+ langchain-community==0.3.8
6
+ langchain-core==0.3.21
7
+ langchain-groq==0.2.1
8
+ langchain-huggingface==0.1.2
9
+ langchain-text-splitters==0.3.2
retriever.py ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # retriever.py
2
+
3
+ from langchain.schema import BaseRetriever
4
+ from typing import List
5
+ from pydantic import BaseModel
6
+
7
+ class CombinedRetriever(BaseRetriever):
8
+ """
9
+ A retriever that combines multiple retrievers and returns the top K relevant documents.
10
+ """
11
+ retrievers: List[BaseRetriever]
12
+ k: int = 5
13
+
14
+ def _get_relevant_documents(self, query: str):
15
+ """
16
+ Retrieve relevant documents by querying all combined retrievers.
17
+
18
+ Args:
19
+ query: The search query string.
20
+
21
+ Returns:
22
+ A list of relevant documents.
23
+ """
24
+ all_docs = []
25
+ for retriever in self.retrievers:
26
+ # Correctly invoke the retriever with the query string
27
+ docs = retriever.get_relevant_documents(query)
28
+ all_docs.extend(docs)
29
+ # Return the top K documents
30
+ return all_docs[:self.k]
31
+
32
+ async def _aget_relevant_documents(self, query: str):
33
+ """
34
+ Asynchronously retrieve relevant documents by querying all combined retrievers.
35
+
36
+ Args:
37
+ query: The search query string.
38
+
39
+ Returns:
40
+ A list of relevant documents.
41
+ """
42
+ all_docs = []
43
+ for retriever in self.retrievers:
44
+ # Correctly invoke the retriever with the query string
45
+ docs = await retriever.aget_relevant_documents(query)
46
+ all_docs.extend(docs)
47
+ # Return the top K documents
48
+ return all_docs[:self.k]
49
+
50
+ def create_combined_retriever(vector_stores, search_kwargs={"k": 3}):
51
+ """
52
+ Create a CombinedRetriever from multiple vector stores.
53
+
54
+ Args:
55
+ vector_stores: A dictionary of vector stores.
56
+ search_kwargs: Keyword arguments for the retrievers (e.g., number of documents).
57
+
58
+ Returns:
59
+ An instance of CombinedRetriever.
60
+ """
61
+ retrievers = [
62
+ vs.as_retriever(search_kwargs=search_kwargs)
63
+ for vs in vector_stores.values()
64
+ ]
65
+
66
+ combined_retriever = CombinedRetriever(
67
+ retrievers=retrievers,
68
+ k=search_kwargs.get("k", 3)
69
+ )
70
+ return combined_retriever
vector_stores/docs_v1/index.faiss ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7d5e5631b2446bc2de838d2dae70fa510cd9f8ebda6a28ce5306f7dd4ef29d9d
3
+ size 108309549
vector_stores/docs_v1/index.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:abd384e8ea668299122189a21b6e22b7f7e4a1310582763ccddd2c78844fc5b8
3
+ size 33772455
vector_stores/docs_v2/index.faiss ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9a3f684205f3d92385757a17b5e5a7ff9f6357a80b1ac2c441c912a1da5c1323
3
+ size 92697645
vector_stores/docs_v2/index.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:44b5dad15385ae447d401cfb7f69fdf52ca4b352bd414e2b9560816e23b8fea1
3
+ size 30836096
vector_stores/docs_v3/index.faiss ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b4e53821ff162bd3b9beaedf6871866e620fb97931e7e90f477b7761cdd21b62
3
+ size 49499181
vector_stores/docs_v3/index.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2bb8ff41c4c4a97f26bb0a456a4f3e18518927d2ee4b4623a7aa595f780d8a82
3
+ size 15370601
vectorstore.py ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # vectorstore.py
2
+
3
+ import os
4
+ from langchain_community.document_loaders import TextLoader
5
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
6
+ from langchain_community.vectorstores import FAISS
7
+
8
+ def load_and_split_document(file_path, chunk_size=1000, chunk_overlap=150):
9
+ """
10
+ Load a document from a file and split it into chunks.
11
+
12
+ Args:
13
+ file_path: Path to the text file.
14
+ chunk_size: The maximum size of each chunk.
15
+ chunk_overlap: The overlap between chunks.
16
+
17
+ Returns:
18
+ A list of document chunks.
19
+ """
20
+ loader = TextLoader(
21
+ file_path,
22
+ encoding='utf-8',
23
+ autodetect_encoding=True
24
+ )
25
+
26
+ try:
27
+ documents = loader.load()
28
+ except RuntimeError:
29
+ # Fallback to a different encoding if autodetection fails
30
+ loader = TextLoader(
31
+ file_path,
32
+ encoding='latin-1',
33
+ autodetect_encoding=False
34
+ )
35
+ documents = loader.load()
36
+
37
+ text_splitter = RecursiveCharacterTextSplitter(
38
+ chunk_size=chunk_size,
39
+ chunk_overlap=chunk_overlap,
40
+ length_function=len
41
+ )
42
+
43
+ chunks = text_splitter.split_documents(documents)
44
+ return chunks
45
+
46
+ def create_vector_stores(doc_paths, embeddings):
47
+ """
48
+ Create vector stores from a list of document paths.
49
+
50
+ Args:
51
+ doc_paths: List of paths to document files.
52
+ embeddings: The embeddings model to use.
53
+
54
+ Returns:
55
+ A dictionary of vector stores.
56
+ """
57
+ vector_stores = {}
58
+ os.makedirs("vector_stores", exist_ok=True)
59
+
60
+ for doc_path in doc_paths:
61
+ store_name = os.path.basename(doc_path).split('.')[0]
62
+ chunks = load_and_split_document(doc_path)
63
+ print(f"Processing {store_name}: {len(chunks)} chunks created")
64
+ vectorstore = FAISS.from_documents(chunks, embeddings)
65
+ vectorstore.save_local(f"vector_stores/{store_name}")
66
+ vector_stores[store_name] = vectorstore
67
+
68
+ return vector_stores
69
+
70
+ def create_vector_store_from_folder(folder_path, embeddings):
71
+ """
72
+ Create a single vector store from all text files in a folder.
73
+
74
+ Args:
75
+ folder_path: Path to the folder containing text files.
76
+ embeddings: The embeddings model to use.
77
+
78
+ Returns:
79
+ A dictionary containing the created vector store.
80
+ """
81
+ vector_stores = {}
82
+ os.makedirs("vector_stores", exist_ok=True)
83
+ all_chunks = []
84
+ file_names = []
85
+
86
+ for filename in os.listdir(folder_path):
87
+ if filename.endswith(".txt"):
88
+ file_path = os.path.join(folder_path, filename)
89
+ chunks = load_and_split_document(file_path)
90
+ all_chunks.extend(chunks)
91
+ file_names.append(filename)
92
+
93
+ print(f"Processing {folder_path}: {len(all_chunks)} chunks created from {len(file_names)} files")
94
+ vectorstore = FAISS.from_documents(all_chunks, embeddings)
95
+ store_name = os.path.basename(folder_path.rstrip('/'))
96
+ vectorstore.save_local(f"vector_stores/{store_name}")
97
+ vector_stores[store_name] = vectorstore
98
+
99
+ return vector_stores
100
+
101
+ def load_all_vector_stores(embeddings):
102
+ """
103
+ Load all vector stores from the 'vector_stores' directory.
104
+
105
+ Args:
106
+ embeddings: The embeddings model to use.
107
+
108
+ Returns:
109
+ A dictionary of loaded vector stores.
110
+ """
111
+ vector_stores = {}
112
+ store_dir = "vector_stores"
113
+
114
+ for store_name in os.listdir(store_dir):
115
+ store_path = os.path.join(store_dir, store_name)
116
+ if os.path.isdir(store_path):
117
+ vector_stores[store_name] = FAISS.load_local(store_path, embeddings, allow_dangerous_deserialization=True)
118
+ return vector_stores