rchrdgwr commited on
Commit
05479f1
1 Parent(s): aa9dd32

FEAT: get ready for qdrant

Browse files
Files changed (3) hide show
  1. BuildingAChainlitApp.md +62 -1
  2. app.py +15 -4
  3. requirements.txt +2 -1
BuildingAChainlitApp.md CHANGED
@@ -129,10 +129,13 @@ def process_text_file(file: AskFileResponse):
129
 
130
  Simply put, this downloads the file as a temp file, we load it in with `TextFileLoader` and then split it with our `TextSplitter`, and returns that list of strings!
131
 
132
- #### QUESTION #1:
 
133
 
134
  Why do we want to support streaming? What about streaming is important, or useful?
135
 
 
 
136
  Streaming is the continuous transmission of the data from the model to the UI. Instead of waiting and batching up the response into a single
137
  large message, the response is sent in pieces (streams) as it is created.
138
 
@@ -144,6 +147,8 @@ The advantages of streaming:
144
  - essential for real time processing
145
  - humans can only read so fast so its an advantage to get some of the data earlier
146
 
 
 
147
  ## On Chat Start:
148
 
149
  The next scope is where "the magic happens". On Chat Start is when a user begins a chat session. This will happen whenever a user opens a new chat window, or refreshes an existing chat window.
@@ -182,10 +187,13 @@ Now, we'll save that into our user session!
182
 
183
  > NOTE: Chainlit has some great documentation about [User Session](https://docs.chainlit.io/concepts/user-session).
184
 
 
 
185
  ### QUESTION #2:
186
 
187
  Why are we using User Session here? What about Python makes us need to use this? Why not just store everything in a global variable?
188
 
 
189
  The application hopefully will be run by many people, at the same time. If the data was stored in a global variable
190
  this would be accessed by everyone using the application. So everytime someone started a new session, the information
191
  would be overwritten, meaning everyone would basically get the same results. Unless only one person used the system
@@ -193,6 +201,7 @@ at a time.
193
 
194
  So the goal is to keep each users session information separate from all the other users. The ChainLit User session
195
  provides the capability of storing each users data separately.
 
196
 
197
  ## On Message
198
 
@@ -227,7 +236,59 @@ If you're still looking for a challenge, or didn't make any modifications to you
227
 
228
  > NOTE: The motivation for these challenges is simple - the beginning of the course is extremely information dense, and people come from all kinds of different technical backgrounds. In order to ensure that all learners are able to engage with the content confidently and comfortably, we want to focus on the basic units of technical competency required. This leads to a situation where some learners, who came in with more robust technical skills, find the introductory material to be too simple - and these open-ended challenges help us do this!
229
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
230
 
 
231
 
 
232
 
233
 
 
129
 
130
  Simply put, this downloads the file as a temp file, we load it in with `TextFileLoader` and then split it with our `TextSplitter`, and returns that list of strings!
131
 
132
+ <div style="border: 2px solid white; padding: 10px; border-radius: 5px; background-color: black; padding: 10px;">
133
+ QUESTION #1:
134
 
135
  Why do we want to support streaming? What about streaming is important, or useful?
136
 
137
+ ### ANSWER #1:
138
+
139
  Streaming is the continuous transmission of the data from the model to the UI. Instead of waiting and batching up the response into a single
140
  large message, the response is sent in pieces (streams) as it is created.
141
 
 
147
  - essential for real time processing
148
  - humans can only read so fast so its an advantage to get some of the data earlier
149
 
150
+ </div>
151
+
152
  ## On Chat Start:
153
 
154
  The next scope is where "the magic happens". On Chat Start is when a user begins a chat session. This will happen whenever a user opens a new chat window, or refreshes an existing chat window.
 
187
 
188
  > NOTE: Chainlit has some great documentation about [User Session](https://docs.chainlit.io/concepts/user-session).
189
 
190
+ <div style="border: 2px solid white; padding: 10px; border-radius: 5px; background-color: black; padding: 10px;">
191
+
192
  ### QUESTION #2:
193
 
194
  Why are we using User Session here? What about Python makes us need to use this? Why not just store everything in a global variable?
195
 
196
+ ### ANSWER #2:
197
  The application hopefully will be run by many people, at the same time. If the data was stored in a global variable
198
  this would be accessed by everyone using the application. So everytime someone started a new session, the information
199
  would be overwritten, meaning everyone would basically get the same results. Unless only one person used the system
 
201
 
202
  So the goal is to keep each users session information separate from all the other users. The ChainLit User session
203
  provides the capability of storing each users data separately.
204
+ </div>
205
 
206
  ## On Message
207
 
 
236
 
237
  > NOTE: The motivation for these challenges is simple - the beginning of the course is extremely information dense, and people come from all kinds of different technical backgrounds. In order to ensure that all learners are able to engage with the content confidently and comfortably, we want to focus on the basic units of technical competency required. This leads to a situation where some learners, who came in with more robust technical skills, find the introductory material to be too simple - and these open-ended challenges help us do this!
238
 
239
+ ## Support pdf documents
240
+
241
+ Code was modified to support pdf documents in the following areas:
242
+
243
+ 1) Change to the request for documents in on_chat_start:
244
+
245
+ - changed the message to ask for .txt or .pdf file
246
+ - changed the acceptable file formats so that the pdf documents are included in the select pop up
247
+
248
+ ```python
249
+ while not files:
250
+ files = await cl.AskFileMessage(
251
+ content="Please upload a .txt or .pdf file to begin processing!",
252
+ accept=["text/plain", "application/pdf"],
253
+ max_size_mb=2,
254
+ timeout=180,
255
+ ).send()
256
+ ```
257
+
258
+ 2) change process_text_file() function to handle .pdf files
259
+
260
+ - identify the file extension
261
+ - read the uploaded document into a temporary file
262
+ - process a .txt file as before resulting in the texts list
263
+ - if the file is .pdf use the PyMuPDF library to read each page and extract the text and add it to texts list
264
+
265
+ ```python
266
+ file_extension = os.path.splitext(file.name)[1].lower()
267
+
268
+ with tempfile.NamedTemporaryFile(mode="wb", delete=False, suffix=file_extension) as temp_file:
269
+ temp_file_path = temp_file.name
270
+ temp_file.write(file.content)
271
+
272
+ if file_extension == ".txt":
273
+ with open(temp_file_path, "r", encoding="utf-8") as f:
274
+ text_loader = TextFileLoader(temp_file_path)
275
+ documents = text_loader.load_documents()
276
+ texts = text_splitter.split_texts(documents)
277
+
278
+ elif file_extension == ".pdf":
279
+ pdf_document = fitz.open(temp_file_path)
280
+ documents = []
281
+ for page_num in range(len(pdf_document)):
282
+ page = pdf_document.load_page(page_num)
283
+ text = page.get_text()
284
+ documents.append(text)
285
+ texts = text_splitter.split_texts(documents)
286
+ else:
287
+ raise ValueError("Unsupported file type")
288
+ ```
289
 
290
+ 3) Test the handling of .pdf and .txt files
291
 
292
+ Several different .pdf and .txt files were successfully uploaded and processed by the app
293
 
294
 
app.py CHANGED
@@ -110,11 +110,22 @@ async def on_chat_start():
110
 
111
  print(f"Processing {len(texts)} text chunks")
112
 
 
 
 
 
113
  # Create a dict vector store
114
- vector_db = VectorDatabase()
115
- vector_db = await vector_db.abuild_from_list(texts)
116
-
117
- chat_openai = ChatOpenAI()
 
 
 
 
 
 
 
118
 
119
  # Create a chain
120
  retrieval_augmented_qa_pipeline = RetrievalAugmentedQAPipeline(
 
110
 
111
  print(f"Processing {len(texts)} text chunks")
112
 
113
+ # decide if to use the dict vector store of the Qdrant vector store
114
+
115
+ use_qdrant = False
116
+
117
  # Create a dict vector store
118
+ if use_qdrant:
119
+ msg = cl.Message(
120
+ content="Sorry, qgrant not implemented yet", disable_human_feedback=True
121
+ )
122
+ await msg.send()
123
+ raise NotImplemented()
124
+ else:
125
+ vector_db = VectorDatabase()
126
+ vector_db = await vector_db.abuild_from_list(texts)
127
+
128
+ chat_openai = ChatOpenAI()
129
 
130
  # Create a chain
131
  retrieval_augmented_qa_pipeline = RetrievalAugmentedQAPipeline(
requirements.txt CHANGED
@@ -1,4 +1,5 @@
1
  numpy
2
  chainlit==0.7.700
3
  openai
4
- pymupdf
 
 
1
  numpy
2
  chainlit==0.7.700
3
  openai
4
+ pymupdf
5
+ qdrant-client