Spaces:

Tanmay09516
/

langchat

Sleeping

App Files Files Community

Tanmay09516 commited on Nov 24, 2024

Commit

dd87c4b

verified ·

1 Parent(s): 14c52b8

Upload 14 files

Browse files

Files changed (15) hide show

.gitattributes +3 -0
README.md +93 -12
app.py +259 -0
build_vectorstore.py +32 -0
chain.py +46 -0
embeddings.py +25 -0
requirements.txt +9 -0
retriever.py +70 -0
vector_stores/docs_v1/index.faiss +3 -0
vector_stores/docs_v1/index.pkl +3 -0
vector_stores/docs_v2/index.faiss +3 -0
vector_stores/docs_v2/index.pkl +3 -0
vector_stores/docs_v3/index.faiss +3 -0
vector_stores/docs_v3/index.pkl +3 -0
vectorstore.py +118 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+vector_stores/docs_v1/index.faiss filter=lfs diff=lfs merge=lfs -text
+vector_stores/docs_v2/index.faiss filter=lfs diff=lfs merge=lfs -text
+vector_stores/docs_v3/index.faiss filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,12 +1,93 @@
----
-title: Langchat
-emoji: 🐢
-colorFrom: yellow
-colorTo: indigo
-sdk: gradio
-sdk_version: 5.6.0
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+**Lang-Chat** is a chatbot application designed to help users understand the [LangChain](https://langchain.com/) library and troubleshoot issues by leveraging scraped documentation and GitHub issues up to **November 22, 2024**. This project started as a personal tool to deepen our understanding of LangChain and to assist with common issues. During development, we discovered [Chat LangChain](https://chat.langchain.com/) (see [GitHub repository](https://github.com/langchain-ai/chat-langchain)), a more comprehensive implementation available freely.
+Despite this, we decided to continue and complete our own version to contribute to the community and offer an alternative solution. **Lang-Chat** serves as a proof of concept and is actively being developed, with more features to be added over time.
+![Lang-Chat Screenshot](screenshot.png) <!-- Optional: Add a screenshot of your app -->
+## Inspiration and Concept
+The idea behind Lang-Chat was to create a personalized assistant that could help us understand the LangChain library and address issues we encountered. By scraping the documentation and GitHub issues up to **November 22, 2024**, we aimed to build a comprehensive knowledge base. Discovering an existing solution like [Chat LangChain](https://chat.langchain.com/) motivated us to continue our project and potentially offer unique features or perspectives.
+### Installation
+1. **Clone the Repository**
+   ```bash
+   git clone https://github.com/Tanmaydoesai/lang-chat.git
+   cd lang-chat
+   ```
+2. **Create a Virtual Environment**
+   It's recommended to use a virtual environment to manage dependencies.
+   ```bash
+   python3 -m venv venv
+   source venv/bin/activate  # On Windows: venv\Scripts\activate
+   ```
+3. **Install Dependencies**
+   ```bash
+   pip install -r requirements.txt
+   ```
+4. **Set Up Environment Variables**
+   Create a `.env` file in the root directory by copying the `.env.template` file and filling in your API key.
+   ```bash
+   cp .env.template .env
+   ```
+   Edit the `.env` file and add your Groq API key:
+   ```dotenv
+   GROQ_API_KEY="your_groq_api_key_here"
+   ```
+   You can obtain a free API key from [https://groq.com/](https://groq.com/).
+### The following steps are optional if you wish to scrape the documentation and issues again, otherwise the vector databases are provided
+1. **Prepare the Data**
+   - **Documentation Files:** Place your LangChain documentation files in the `docs/` directory. Ensure they are in `.txt` format.
+   - **GitHub Issues:** Scrape and format GitHub issues from the LangChain repository up to **November 22, 2024**, into the `formatted_issues/` directory. Ensure they are in `.txt` format.
+2. **Build Vector Stores**
+   Before running the application, build the vector stores from your documents and issues.
+   ```bash
+   python build_vectorstore.py
+   ```
+   This will process the documents and create vector stores in the `vector_stores/` directory.
+### Running the Application
+After the previous steps are completed start the Gradio interface by running:
+```bash
+python app.py
+```
+After running the command, you should see a local URL (e.g., `http://127.0.0.1:7860/`) in your terminal. Open this URL in your web browser to interact with the Lang-Chat chatbot.
+### Usage
+1. **Ask a Question:** Enter your question about LangChain in the "Your Question" textbox and press "Send" or hit Enter.
+2. **View Chat History:** The chat history will display your questions and the assistant's responses.
+3. **Explore Sources:** In the "Source Documents" section, select a source document from the dropdown to view the full content that the assistant referenced.
+### Contributing
+Lang-Chat is an actively developing proof of concept. Contributions are welcome! Please open issues or submit pull requests for improvements, bug fixes, or new features
+Feel free to reach out or open an issue if you have any questions or suggestions!
+---

app.py ADDED Viewed

	@@ -0,0 +1,259 @@

+# app.py
+import gradio as gr
+from embeddings import init_embeddings
+from vectorstore import load_all_vector_stores
+from retriever import create_combined_retriever
+from chain import init_conversational_chain
+from langchain_groq import ChatGroq  # Custom LLM class
+from dotenv import load_dotenv
+import os
+import sys
+# Disable parallelism warnings from tokenizers
+os.environ["TOKENIZERS_PARALLELISM"] = "false"
+def init_llm():
+    """
+    Initialize the Language Model (LLM) using the ChatGroq class.
+    Loads environment variables from a .env file.
+    """
+    load_dotenv()
+    llm = ChatGroq()
+    return llm
+def setup():
+    """
+    Set up the QA chain by initializing embeddings, loading vector stores,
+    creating a combined retriever, and initializing the conversational chain.
+    """
+    embeddings = init_embeddings()
+    # Check if vector stores exist
+    if not os.path.exists("vector_stores") or not os.listdir("vector_stores"):
+        print("Vector stores not found. Please run 'build_vectorstore.py' first.")
+        sys.exit(1)
+    # Load all vector stores
+    vector_stores = load_all_vector_stores(embeddings)
+    # Create a combined retriever from all vector stores
+    retriever = create_combined_retriever(vector_stores)
+    # Initialize the LLM
+    llm = init_llm()
+    # Initialize the conversational QA chain
+    qa_chain = init_conversational_chain(llm, retriever)
+    return qa_chain
+# Set up the QA chain
+qa_chain = setup()
+def format_source_doc(doc):
+    """
+    Format a source document for display.
+    Args:
+        doc: A document object containing page_content and metadata.
+    Returns:
+        A dictionary with a preview, full content, and source.
+    """
+    preview = doc.page_content[:150] + "..."  # Short preview
+    source = doc.metadata.get('source', 'Unknown')
+    return {
+        "preview": preview,
+        "full_content": doc.page_content,
+        "source": source
+    }
+def get_chat_history_tuples(history_messages):
+    """
+    Convert the chat history from a list of message dictionaries to a list of tuples.
+    Args:
+        history_messages: List of message dictionaries with 'role' and 'content'.
+    Returns:
+        List of tuples in the form (user_message, assistant_message).
+    """
+    chat_history_tuples = []
+    user_msg = None
+    assistant_msg = None
+    for msg in history_messages:
+        if msg['role'] == 'user':
+            if user_msg is not None:
+                # Append previous user message without assistant response
+                chat_history_tuples.append((user_msg, assistant_msg))
+            user_msg = msg['content']
+            assistant_msg = None
+        elif msg['role'] == 'assistant':
+            assistant_msg = msg['content']
+            chat_history_tuples.append((user_msg, assistant_msg))
+            user_msg = None
+            assistant_msg = None
+    # Append any remaining user message
+    if user_msg is not None:
+        chat_history_tuples.append((user_msg, assistant_msg))
+    return chat_history_tuples
+def chatbot(message, history):
+    """
+    Handle the chatbot interaction by invoking the QA chain and formatting the response.
+    Args:
+        message: The user's message.
+        history: The chat history.
+    Returns:
+        A tuple containing the assistant's answer and the list of source documents.
+    """
+    # Convert history to list of tuples
+    if history is None:
+        history = []
+    chat_history = get_chat_history_tuples(history)
+    # Invoke the QA chain with the formatted history
+    response = qa_chain.invoke({
+        "question": message,
+        "chat_history": chat_history
+    })
+    # Format the response as a message dictionary
+    answer = {
+        "role": "assistant",
+        "content": response["answer"]
+    }
+    # Format source documents
+    source_docs = [format_source_doc(doc) for doc in response["source_documents"]]
+    return answer, source_docs
+def show_popup(source_doc):
+    """
+    Show a popup with the full content of the selected source document.
+    Args:
+        source_doc: The selected source document.
+    Returns:
+        An update object for the Gradio Textbox component.
+    """
+    return gr.update(
+        value=f"Source: {source_doc['source']}\n\n{source_doc['full_content']}",
+        visible=True
+    )
+# Define the Gradio Blocks interface
+with gr.Blocks(css="""
+    .source-box { margin: 5px; padding: 10px; border: 1px solid #ddd; border-radius: 5px; }
+    .source-box:hover { background-color: #f5f5f5; cursor: pointer; }
+""") as demo:
+    gr.Markdown("# Lang-Chat Chatbot")
+    with gr.Row():
+        with gr.Column(scale=7):
+            # Chat history component
+            chatbot_component = gr.Chatbot(
+                label="Chat History",
+                height=500,
+                bubble_full_width=False,
+                type="messages"
+            )
+            with gr.Row():
+                # Input textbox for user messages
+                msg = gr.Textbox(
+                    label="Your Question",
+                    placeholder="Ask me anything about LangChain...",
+                    scale=8
+                )
+                # Submit button
+                submit = gr.Button("Send", scale=1)
+        with gr.Column(scale=3):
+            gr.Markdown("### Source Documents")
+            # Dropdown to select source documents
+            source_dropdown = gr.Dropdown(
+                label="Select a Source Document",
+                interactive=True
+            )
+            # Textbox to display full content of the selected document
+            popup = gr.Textbox(
+                label="Document Details",
+                interactive=False,
+                visible=False,
+                lines=10
+            )
+            # Hidden state to store source data
+            source_data_state = gr.State()
+    def process_message(message, history):
+        """
+        Process the user's message, update chat history, and prepare source document options.
+        Args:
+            message: The user's message.
+            history: The current chat history.
+        Returns:
+            Updated chat history, updated source dropdown options, and updated source data state.
+        """
+        if history is None:
+            history = []
+        answer, sources = chatbot(message, history)
+        # Append the new user message and assistant response to history
+        history.append({"role": "user", "content": message})
+        history.append(answer)
+        # Prepare options for the dropdown
+        source_options = []
+        for idx, source in enumerate(sources):
+            option_label = f"{idx+1}. {source['source']} - {source['preview'][:30]}..."
+            source_options.append(option_label)
+        # Store sources in state
+        source_data_state = sources
+        return history, gr.update(choices=source_options, value=None), source_data_state
+    # Define the submit action for both the textbox and the button
+    msg.submit(
+        process_message,
+        [msg, chatbot_component],
+        [chatbot_component, source_dropdown, source_data_state]
+    )
+    submit.click(
+        process_message,
+        [msg, chatbot_component],
+        [chatbot_component, source_dropdown, source_data_state]
+    )
+    def show_popup(selected_option, source_data_state):
+        """
+        Display the full content of the selected source document in a popup.
+        Args:
+            selected_option: The selected option from the dropdown.
+            source_data_state: The list of source documents.
+        Returns:
+            An update object for the popup textbox.
+        """
+        if selected_option is None:
+            return gr.update(visible=False)
+        sources = source_data_state
+        # Extract index from selected_option
+        idx = int(selected_option.split('.')[0]) - 1
+        source = sources[idx]
+        full_content = f"Source: {source['source']}\n\n{source['full_content']}"
+        return gr.update(value=full_content, visible=True)
+    # Define the change action for the dropdown
+    source_dropdown.change(show_popup, inputs=[source_dropdown, source_data_state], outputs=popup)
+# Launch the Gradio interface
+demo.launch()

build_vectorstore.py ADDED Viewed

	@@ -0,0 +1,32 @@

+# build_vectorstore.py
+from embeddings import init_embeddings
+from vectorstore import create_vector_stores, create_vector_store_from_folder
+import os
+def main():
+    """
+    Main function to build vector stores from specified document paths and folders.
+    """
+    # Initialize embeddings
+    embeddings = init_embeddings()
+    # List of document paths to process
+    doc_paths = [
+        "docs/docs_v1.txt",
+        "docs/docs_v2.txt",
+        "docs/docs_v3.txt"
+    ]
+    # Create vector stores for individual documents
+    create_vector_stores(doc_paths, embeddings)
+    # Create vector store from the 'formatted_issues' folder
+    formatted_issues_folder = "formatted_issues"
+    if os.path.exists(formatted_issues_folder):
+        create_vector_store_from_folder(formatted_issues_folder, embeddings)
+    else:
+        print(f"Folder {formatted_issues_folder} does not exist. Skipping.")
+if __name__ == "__main__":
+    main()

chain.py ADDED Viewed

	@@ -0,0 +1,46 @@

+# chain.py
+from langchain.chains import ConversationalRetrievalChain
+from langchain.memory import ConversationBufferMemory
+from langchain.prompts import PromptTemplate
+def init_conversational_chain(llm, retriever):
+    """
+    Initialize the Conversational Retrieval Chain with memory and custom prompt.
+    Args:
+        llm: The language model to use.
+        retriever: The retriever to fetch relevant documents.
+    Returns:
+        An instance of ConversationalRetrievalChain.
+    """
+    # Initialize conversation memory
+    memory = ConversationBufferMemory(
+        return_messages=True,
+        memory_key="chat_history",
+        output_key="answer"
+    )
+    # Define a custom prompt template
+    custom_prompt = PromptTemplate(
+        input_variables=["context", "question"],
+        template=(
+            "You are LangAssist, a knowledgeable assistant for the LangChain Python Library. "
+            "Given the following context from the documentation, provide a helpful answer to the user's question.\n\n"
+            "Context:\n{context}\n\n"
+            "Question: {question}\n\n"
+            "Answer:"
+        )
+    )
+    # Initialize the Conversational Retrieval Chain
+    qa_chain = ConversationalRetrievalChain.from_llm(
+        llm=llm,
+        retriever=retriever,
+        memory=memory,
+        return_source_documents=True,
+        combine_docs_chain_kwargs={"prompt": custom_prompt},
+        verbose=False
+    )
+    return qa_chain

embeddings.py ADDED Viewed

	@@ -0,0 +1,25 @@

+# embeddings.py
+from langchain_huggingface import HuggingFaceEmbeddings
+import torch
+def init_embeddings():
+    """
+    Initialize the HuggingFace embeddings model.
+    Returns:
+        An instance of HuggingFaceEmbeddings.
+    """
+    model_name = "sentence-transformers/all-mpnet-base-v2"
+    model_kwargs = {
+        'device': 'cuda' if torch.cuda.is_available() else 'cpu'
+    }
+    encode_kwargs = {'normalize_embeddings': False}
+    embeddings = HuggingFaceEmbeddings(
+        model_name=model_name,
+        model_kwargs=model_kwargs,
+        encode_kwargs=encode_kwargs
+    )
+    return embeddings

requirements.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+faiss-cpu==1.9.0.post1
+gradio==5.6.0
+gradio_client==1.4.3
+langchain==0.3.8
+langchain-community==0.3.8
+langchain-core==0.3.21
+langchain-groq==0.2.1
+langchain-huggingface==0.1.2
+langchain-text-splitters==0.3.2

retriever.py ADDED Viewed

	@@ -0,0 +1,70 @@

+# retriever.py
+from langchain.schema import BaseRetriever
+from typing import List
+from pydantic import BaseModel
+class CombinedRetriever(BaseRetriever):
+    """
+    A retriever that combines multiple retrievers and returns the top K relevant documents.
+    """
+    retrievers: List[BaseRetriever]
+    k: int = 5
+    def _get_relevant_documents(self, query: str):
+        """
+        Retrieve relevant documents by querying all combined retrievers.
+        Args:
+            query: The search query string.
+        Returns:
+            A list of relevant documents.
+        """
+        all_docs = []
+        for retriever in self.retrievers:
+            # Correctly invoke the retriever with the query string
+            docs = retriever.get_relevant_documents(query)
+            all_docs.extend(docs)
+        # Return the top K documents
+        return all_docs[:self.k]
+    async def _aget_relevant_documents(self, query: str):
+        """
+        Asynchronously retrieve relevant documents by querying all combined retrievers.
+        Args:
+            query: The search query string.
+        Returns:
+            A list of relevant documents.
+        """
+        all_docs = []
+        for retriever in self.retrievers:
+            # Correctly invoke the retriever with the query string
+            docs = await retriever.aget_relevant_documents(query)
+            all_docs.extend(docs)
+        # Return the top K documents
+        return all_docs[:self.k]
+def create_combined_retriever(vector_stores, search_kwargs={"k": 3}):
+    """
+    Create a CombinedRetriever from multiple vector stores.
+    Args:
+        vector_stores: A dictionary of vector stores.
+        search_kwargs: Keyword arguments for the retrievers (e.g., number of documents).
+    Returns:
+        An instance of CombinedRetriever.
+    """
+    retrievers = [
+        vs.as_retriever(search_kwargs=search_kwargs)
+        for vs in vector_stores.values()
+    ]
+    combined_retriever = CombinedRetriever(
+        retrievers=retrievers,
+        k=search_kwargs.get("k", 3)
+    )
+    return combined_retriever

vector_stores/docs_v1/index.faiss ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7d5e5631b2446bc2de838d2dae70fa510cd9f8ebda6a28ce5306f7dd4ef29d9d
+size 108309549

vector_stores/docs_v1/index.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:abd384e8ea668299122189a21b6e22b7f7e4a1310582763ccddd2c78844fc5b8
+size 33772455

vector_stores/docs_v2/index.faiss ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9a3f684205f3d92385757a17b5e5a7ff9f6357a80b1ac2c441c912a1da5c1323
+size 92697645

vector_stores/docs_v2/index.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:44b5dad15385ae447d401cfb7f69fdf52ca4b352bd414e2b9560816e23b8fea1
+size 30836096

vector_stores/docs_v3/index.faiss ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b4e53821ff162bd3b9beaedf6871866e620fb97931e7e90f477b7761cdd21b62
+size 49499181

vector_stores/docs_v3/index.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2bb8ff41c4c4a97f26bb0a456a4f3e18518927d2ee4b4623a7aa595f780d8a82
+size 15370601

vectorstore.py ADDED Viewed

	@@ -0,0 +1,118 @@

+# vectorstore.py
+import os
+from langchain_community.document_loaders import TextLoader
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+from langchain_community.vectorstores import FAISS
+def load_and_split_document(file_path, chunk_size=1000, chunk_overlap=150):
+    """
+    Load a document from a file and split it into chunks.
+    Args:
+        file_path: Path to the text file.
+        chunk_size: The maximum size of each chunk.
+        chunk_overlap: The overlap between chunks.
+    Returns:
+        A list of document chunks.
+    """
+    loader = TextLoader(
+        file_path,
+        encoding='utf-8',
+        autodetect_encoding=True
+    )
+    try:
+        documents = loader.load()
+    except RuntimeError:
+        # Fallback to a different encoding if autodetection fails
+        loader = TextLoader(
+            file_path,
+            encoding='latin-1',
+            autodetect_encoding=False
+        )
+        documents = loader.load()
+    text_splitter = RecursiveCharacterTextSplitter(
+        chunk_size=chunk_size,
+        chunk_overlap=chunk_overlap,
+        length_function=len
+    )
+    chunks = text_splitter.split_documents(documents)
+    return chunks
+def create_vector_stores(doc_paths, embeddings):
+    """
+    Create vector stores from a list of document paths.
+    Args:
+        doc_paths: List of paths to document files.
+        embeddings: The embeddings model to use.
+    Returns:
+        A dictionary of vector stores.
+    """
+    vector_stores = {}
+    os.makedirs("vector_stores", exist_ok=True)
+    for doc_path in doc_paths:
+        store_name = os.path.basename(doc_path).split('.')[0]
+        chunks = load_and_split_document(doc_path)
+        print(f"Processing {store_name}: {len(chunks)} chunks created")
+        vectorstore = FAISS.from_documents(chunks, embeddings)
+        vectorstore.save_local(f"vector_stores/{store_name}")
+        vector_stores[store_name] = vectorstore
+    return vector_stores
+def create_vector_store_from_folder(folder_path, embeddings):
+    """
+    Create a single vector store from all text files in a folder.
+    Args:
+        folder_path: Path to the folder containing text files.
+        embeddings: The embeddings model to use.
+    Returns:
+        A dictionary containing the created vector store.
+    """
+    vector_stores = {}
+    os.makedirs("vector_stores", exist_ok=True)
+    all_chunks = []
+    file_names = []
+    for filename in os.listdir(folder_path):
+        if filename.endswith(".txt"):
+            file_path = os.path.join(folder_path, filename)
+            chunks = load_and_split_document(file_path)
+            all_chunks.extend(chunks)
+            file_names.append(filename)
+    print(f"Processing {folder_path}: {len(all_chunks)} chunks created from {len(file_names)} files")
+    vectorstore = FAISS.from_documents(all_chunks, embeddings)
+    store_name = os.path.basename(folder_path.rstrip('/'))
+    vectorstore.save_local(f"vector_stores/{store_name}")
+    vector_stores[store_name] = vectorstore
+    return vector_stores
+def load_all_vector_stores(embeddings):
+    """
+    Load all vector stores from the 'vector_stores' directory.
+    Args:
+        embeddings: The embeddings model to use.
+    Returns:
+        A dictionary of loaded vector stores.
+    """
+    vector_stores = {}
+    store_dir = "vector_stores"
+    for store_name in os.listdir(store_dir):
+        store_path = os.path.join(store_dir, store_name)
+        if os.path.isdir(store_path):
+            vector_stores[store_name] = FAISS.load_local(store_path, embeddings, allow_dangerous_deserialization=True)
+    return vector_stores