Spaces:

rchrdgwr
/

CoolApp

Sleeping

App Files Files Community

rchrdgwr commited on Aug 22

Commit

05479f1

•

1 Parent(s): aa9dd32

FEAT: get ready for qdrant

Browse files

Files changed (3) hide show

BuildingAChainlitApp.md +62 -1
app.py +15 -4
requirements.txt +2 -1

BuildingAChainlitApp.md CHANGED Viewed

@@ -129,10 +129,13 @@ def process_text_file(file: AskFileResponse):
 Simply put, this downloads the file as a temp file, we load it in with `TextFileLoader` and then split it with our `TextSplitter`, and returns that list of strings!
-#### QUESTION #1:
 Why do we want to support streaming? What about streaming is important, or useful?
 Streaming is the continuous transmission of the data from the model to the UI. Instead of waiting and batching up the response into a single
 large message, the response is sent in pieces (streams) as it is created.
@@ -144,6 +147,8 @@ The advantages of streaming:
 - essential for real time processing
 - humans can only read so fast so its an advantage to get some of the data earlier
 ## On Chat Start:
 The next scope is where "the magic happens". On Chat Start is when a user begins a chat session. This will happen whenever a user opens a new chat window, or refreshes an existing chat window.
@@ -182,10 +187,13 @@ Now, we'll save that into our user session!
 > NOTE: Chainlit has some great documentation about [User Session](https://docs.chainlit.io/concepts/user-session).
 ### QUESTION #2:
 Why are we using User Session here? What about Python makes us need to use this? Why not just store everything in a global variable?
 The application hopefully will be run by many people, at the same time. If the data was stored in a global variable
 this would be accessed by everyone using the application. So everytime someone started a new session, the information
 would be overwritten, meaning everyone would basically get the same results. Unless only one person used the system
@@ -193,6 +201,7 @@ at a time.
 So the goal is to keep each users session information separate from all the other users. The ChainLit User session
 provides the capability of storing each users data separately.
 ## On Message
@@ -227,7 +236,59 @@ If you're still looking for a challenge, or didn't make any modifications to you
 > NOTE: The motivation for these challenges is simple - the beginning of the course is extremely information dense, and people come from all kinds of different technical backgrounds. In order to ensure that all learners are able to engage with the content confidently and comfortably, we want to focus on the basic units of technical competency required. This leads to a situation where some learners, who came in with more robust technical skills, find the introductory material to be too simple - and these open-ended challenges help us do this!

 Simply put, this downloads the file as a temp file, we load it in with `TextFileLoader` and then split it with our `TextSplitter`, and returns that list of strings!
+<div style="border: 2px solid white; padding: 10px; border-radius: 5px; background-color: black; padding: 10px;">
+QUESTION #1:
 Why do we want to support streaming? What about streaming is important, or useful?
+### ANSWER #1:
 Streaming is the continuous transmission of the data from the model to the UI. Instead of waiting and batching up the response into a single
 large message, the response is sent in pieces (streams) as it is created.
 - essential for real time processing
 - humans can only read so fast so its an advantage to get some of the data earlier
+</div>
 ## On Chat Start:
 The next scope is where "the magic happens". On Chat Start is when a user begins a chat session. This will happen whenever a user opens a new chat window, or refreshes an existing chat window.
 > NOTE: Chainlit has some great documentation about [User Session](https://docs.chainlit.io/concepts/user-session).
+<div style="border: 2px solid white; padding: 10px; border-radius: 5px; background-color: black; padding: 10px;">
 ### QUESTION #2:
 Why are we using User Session here? What about Python makes us need to use this? Why not just store everything in a global variable?
+### ANSWER #2:
 The application hopefully will be run by many people, at the same time. If the data was stored in a global variable
 this would be accessed by everyone using the application. So everytime someone started a new session, the information
 would be overwritten, meaning everyone would basically get the same results. Unless only one person used the system
 So the goal is to keep each users session information separate from all the other users. The ChainLit User session
 provides the capability of storing each users data separately.
+</div>
 ## On Message
 > NOTE: The motivation for these challenges is simple - the beginning of the course is extremely information dense, and people come from all kinds of different technical backgrounds. In order to ensure that all learners are able to engage with the content confidently and comfortably, we want to focus on the basic units of technical competency required. This leads to a situation where some learners, who came in with more robust technical skills, find the introductory material to be too simple - and these open-ended challenges help us do this!
+## Support pdf documents
+Code was modified to support pdf documents in the following areas:
+1) Change to the request for documents in on_chat_start:
+- changed the message to ask for .txt or .pdf file
+- changed the acceptable file formats so that the pdf documents are included in the select pop up
+```python
+    while not files:
+        files = await cl.AskFileMessage(
+            content="Please upload a .txt or .pdf file to begin processing!",
+            accept=["text/plain", "application/pdf"],
+            max_size_mb=2,
+            timeout=180,
+        ).send()
+```
+2) change process_text_file() function to handle .pdf files
+- identify the file extension
+- read the uploaded document into a temporary file
+- process a .txt file as before resulting in the texts list
+- if the file is .pdf use the PyMuPDF library to read each page and extract the text and add it to texts list
+```python
+    file_extension = os.path.splitext(file.name)[1].lower()
+    with tempfile.NamedTemporaryFile(mode="wb", delete=False, suffix=file_extension) as temp_file:
+        temp_file_path = temp_file.name
+        temp_file.write(file.content)
+    if file_extension == ".txt":
+        with open(temp_file_path, "r", encoding="utf-8") as f:
+            text_loader = TextFileLoader(temp_file_path)
+            documents = text_loader.load_documents()
+            texts = text_splitter.split_texts(documents)
+    elif file_extension == ".pdf":
+        pdf_document = fitz.open(temp_file_path)
+        documents = []
+        for page_num in range(len(pdf_document)):
+            page = pdf_document.load_page(page_num)
+            text = page.get_text()
+            documents.append(text)
+        texts = text_splitter.split_texts(documents)
+    else:
+        raise ValueError("Unsupported file type")
+```
+3) Test the handling of .pdf and .txt files
+Several different .pdf and .txt files were successfully uploaded and processed by the app

app.py CHANGED Viewed

@@ -110,11 +110,22 @@ async def on_chat_start():
     print(f"Processing {len(texts)} text chunks")
     # Create a dict vector store
-    vector_db = VectorDatabase()
-    vector_db = await vector_db.abuild_from_list(texts)
-    chat_openai = ChatOpenAI()
     # Create a chain
     retrieval_augmented_qa_pipeline = RetrievalAugmentedQAPipeline(

     print(f"Processing {len(texts)} text chunks")
+    # decide if to use the dict vector store of the Qdrant vector store
+    use_qdrant = False
     # Create a dict vector store
+    if use_qdrant:
+        msg = cl.Message(
+            content="Sorry, qgrant not implemented yet", disable_human_feedback=True
+        )
+        await msg.send()
+        raise NotImplemented()
+    else:
+        vector_db = VectorDatabase()
+        vector_db = await vector_db.abuild_from_list(texts)
+        chat_openai = ChatOpenAI()
     # Create a chain
     retrieval_augmented_qa_pipeline = RetrievalAugmentedQAPipeline(

requirements.txt CHANGED Viewed

@@ -1,4 +1,5 @@
 numpy
 chainlit==0.7.700
 openai
-pymupdf

 numpy
 chainlit==0.7.700
 openai
+pymupdf
+qdrant-client