Spaces:

seanpedrickcase
/

topic_modelling

Running

App Files Files Community

seanpedrickcase commited on Jun 20, 2024

Commit

04a15c5

1 Parent(s): d80c8f5

Updated packages. Improve hierarchy vis. Better models - mixedbread and phi3. Now option to split texts into sentences before modelling.

Browse files

Files changed (11) hide show

Dockerfile +45 -1
README.md +10 -2
app.py +10 -4
funcs/anonymiser.py +14 -4
funcs/embeddings.py +1 -1
funcs/helper_functions.py +1 -1
funcs/prompts.py +33 -1
funcs/representation_model.py +34 -41
funcs/topic_core_funcs.py +47 -12
requirements.txt +9 -8
requirements_gpu.txt +17 -0

Dockerfile CHANGED Viewed

@@ -1,4 +1,16 @@
-FROM python:3.10
 WORKDIR /src
@@ -6,10 +18,36 @@ COPY requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
 # Set up a new user named "user" with user ID 1000
 RUN useradd -m -u 1000 user
 # Switch to the "user" user
 USER user
 # Set home to the user's home directory
 ENV HOME=/home/user \
 	PATH=/home/user/.local/bin:$PATH \
@@ -18,7 +56,11 @@ ENV HOME=/home/user \
 	GRADIO_ALLOW_FLAGGING=never \
 	GRADIO_NUM_PORTS=1 \
 	GRADIO_SERVER_NAME=0.0.0.0 \
 	GRADIO_THEME=huggingface \
 	SYSTEM=spaces
 # Set the working directory to the user's home directory
@@ -26,5 +68,7 @@ WORKDIR $HOME/app
 # Copy the current directory contents into the container at $HOME/app setting the owner to the user
 COPY --chown=user . $HOME/app
 CMD ["python", "app.py"]

+# First stage: build dependencies
+FROM public.ecr.aws/docker/library/python:3.11.9-slim-bookworm
+# Install Lambda web adapter in case you want to run with with an AWS Lamba function URL
+COPY --from=public.ecr.aws/awsguru/aws-lambda-adapter:0.8.3 /lambda-adapter /opt/extensions/lambda-adapter
+# Install wget and curl
+RUN apt-get update && apt-get install -y \
+	wget \
+	curl
+# Create a directory for the model
+RUN mkdir /model
 WORKDIR /src
 RUN pip install --no-cache-dir -r requirements.txt
+# Gradio needs to be installed after due to conflict with spacy in requirements
+RUN pip install --no-cache-dir gradio==4.36.1
+# Download the quantised phi model directly with curl
+RUN curl -L -o Phi-3-mini-128k-instruct.Q4_K_M.gguf https://huggingface.co/QuantFactory/Phi-3-mini-128k-instruct-GGUF/tree/main/Phi-3-mini-128k-instruct.Q4_K_M.gguf
+# If needed, move the file to your desired directory in the Docker image
+RUN mv Phi-3-mini-128k-instruct.Q4_K_M.gguf /model/rep/
+# Download the Mixed bread embedding model during the build process
+RUN curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
+RUN apt-get install git-lfs -y
+RUN git lfs install
+RUN git clone https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1 /model/embed
+RUN rm -rf /model/embed/.git
 # Set up a new user named "user" with user ID 1000
 RUN useradd -m -u 1000 user
+# Change ownership of /home/user directory
+RUN chown -R user:user /home/user
+# Make output folder
+RUN mkdir -p /home/user/app/output && chown -R user:user /home/user/app/output
+RUN mkdir -p /home/user/.cache/huggingface/hub && chown -R user:user /home/user/.cache/huggingface/hub
+RUN mkdir -p /home/user/.cache/matplotlib && chown -R user:user /home/user/.cache/matplotlib
 # Switch to the "user" user
 USER user
 # Set home to the user's home directory
 ENV HOME=/home/user \
 	PATH=/home/user/.local/bin:$PATH \
 	GRADIO_ALLOW_FLAGGING=never \
 	GRADIO_NUM_PORTS=1 \
 	GRADIO_SERVER_NAME=0.0.0.0 \
+	GRADIO_SERVER_PORT=7860 \
 	GRADIO_THEME=huggingface \
+	AWS_STS_REGIONAL_ENDPOINT=regional \
+	GRADIO_OUTPUT_FOLDER='output/' \
+	#GRADIO_ROOT_PATH=/data-text-search \
 	SYSTEM=spaces
 # Set the working directory to the user's home directory
 # Copy the current directory contents into the container at $HOME/app setting the owner to the user
 COPY --chown=user . $HOME/app
+#COPY . $HOME/app
 CMD ["python", "app.py"]

README.md CHANGED Viewed

@@ -4,10 +4,18 @@ emoji: 🚀
 colorFrom: red
 colorTo: yellow
 sdk: gradio
-sdk_version: 4.16.0
 app_file: app.py
 pinned: true
 license: apache-2.0
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 colorFrom: red
 colorTo: yellow
 sdk: gradio
+sdk_version: 4.36.1
 app_file: app.py
 pinned: true
 license: apache-2.0
 ---
+# Topic modeller
+Generate topics from open text in tabular data, based on [BERTopic](https://maartengr.github.io/BERTopic/). Upload a data file (csv, xlsx, or parquet), then specify the open text column that you want to use to generate topics. Click 'Extract topics' after you have selected the minimum similar documents per topic and maximum total topics. Duplicate this space, or clone to your computer to avoid queues here!
+Uses fast TF-IDF-based embeddings by default, which are fast but not very performant in terms of cluster. Change to [Mixedbread large v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) model embeddings (512 dimensions, 8 bit quantisation) on the options page for topics of much higher quality, but slower processing time. If you have an embeddings .npz file previously made using this model, you can load this in at the same time to skip the first modelling step. If you have a pre-defined list of topics for zero-shot modelling, you can upload this as a csv file under 'I have my own list of topics...'. Further configuration options are available under the 'Options' tab. Topic representation with LLMs currently based on [Phi-3-mini-128k-instruct-GGUF](https://huggingface.co/QuantFactory/Phi-3-mini-128k-instruct-GGUF), which is quite slow on CPU, so use a GPU-enabled computer if possible, building from the requirements_gpu.txt file in the base folder.
+For small datasets, consider breaking up your text into sentences under 'Clean data' -> 'Split open text...' before topic modelling.
+I suggest [Wikipedia mini dataset](https://huggingface.co/datasets/rag-datasets/mini_wikipedia/tree/main/data) for testing the tool here, choose passages.parquet.

app.py CHANGED Viewed

@@ -10,6 +10,7 @@ from funcs.topic_core_funcs import pre_clean, extract_topics, reduce_outliers, r
 from funcs.helper_functions import initial_file_load, custom_regex_load
 from sklearn.feature_extraction.text import CountVectorizer
 # Gradio app
 block = gr.Blocks(theme = gr.themes.Base())
@@ -32,7 +33,9 @@ with block:
     # Topic modeller
     Generate topics from open text in tabular data, based on [BERTopic](https://maartengr.github.io/BERTopic/). Upload a data file (csv, xlsx, or parquet), then specify the open text column that you want to use to generate topics. Click 'Extract topics' after you have selected the minimum similar documents per topic and maximum total topics. Duplicate this space, or clone to your computer to avoid queues here!
-    Uses fast TF-IDF-based embeddings by default, which are fast but not very performant in terms of cluster. Change to [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) model embeddings on the options page for topics of much higher quality, but slower processing time. If you have an embeddings .npz file previously made using this model, you can load this in at the same time to skip the first modelling step. If you have a pre-defined list of topics for zero-shot modelling, you can upload this as a csv file under 'I have my own list of topics...'. Further configuration options are available under the 'Options' tab. Topic representation with LLMs currently based on [StableLM-2-Zephyr-1.6B-GGUF](https://huggingface.co/second-state/stablelm-2-zephyr-1.6b-GGUF) - this works locally, but unfortunately this doesn't yet seem to work on the Huggingface website, I'm working on it!
     I suggest [Wikipedia mini dataset](https://huggingface.co/datasets/rag-datasets/mini_wikipedia/tree/main/data) for testing the tool here, choose passages.parquet.
     """)
@@ -48,9 +51,10 @@ with block:
                 clean_text = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Clean data - remove html, numbers with > 1 digits, emails, postcodes (UK), custom regex.")
                 drop_duplicate_text = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Remove duplicate text, drop < 50 char strings. May make old embedding files incompatible due to differing lengths.")
                 anonymise_drop = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Anonymise data on file load. Personal details are redacted - not 100% effective. This is slow!")
             with gr.Row():
                 custom_regex = gr.UploadButton(label="Import custom regex file", file_count="multiple")
-                gr.Markdown("""Import custom regex - csv table with one column of regex patterns with header. Example pattern: (?i)roosevelt for case insensitive removal of this term.""")
                 custom_regex_text = gr.Textbox(label="Custom regex load status")
             clean_btn = gr.Button("Clean data")
@@ -108,7 +112,7 @@ with block:
     # Clean data
     custom_regex.upload(fn=custom_regex_load, inputs=[custom_regex], outputs=[custom_regex_text, custom_regex_state])
-    clean_btn.click(fn=pre_clean, inputs=[data_state, in_colnames, data_file_name_no_ext_state, custom_regex_state, clean_text, drop_duplicate_text, anonymise_drop], outputs=[output_single_text, output_file, data_state, data_file_name_no_ext_state], api_name="clean")
     # Extract topics
     topics_btn.click(fn=extract_topics, inputs=[data_state, in_files, min_docs_slider, in_colnames, max_topics_slider, candidate_topics, data_file_name_no_ext_state, label_list_state, return_intermediate_files, embedding_super_compress, low_resource_mode_opt, save_topic_model, embeddings_state, embeddings_type_state, zero_shot_similarity, seed_number, calc_probs, vectoriser_state], outputs=[output_single_text, output_file, embeddings_state, embeddings_type_state, data_file_name_no_ext_state, topic_model_state, docs_state, vectoriser_state, assigned_topics_state], api_name="topics")
@@ -125,4 +129,6 @@ with block:
     # Visualise topics
     plot_btn.click(fn=visualise_topics, inputs=[topic_model_state, data_state, data_file_name_no_ext_state, low_resource_mode_opt, embeddings_state, in_label, in_colnames, legend_label, sample_slide, visualisation_type_radio, seed_number], outputs=[vis_output_single_text, out_plot_file, plot, plot_2], api_name="plot")
-block.queue().launch(debug=True)#, server_name="0.0.0.0", ssl_verify=False, server_port=7860)

 from funcs.helper_functions import initial_file_load, custom_regex_load
 from sklearn.feature_extraction.text import CountVectorizer
 # Gradio app
 block = gr.Blocks(theme = gr.themes.Base())
     # Topic modeller
     Generate topics from open text in tabular data, based on [BERTopic](https://maartengr.github.io/BERTopic/). Upload a data file (csv, xlsx, or parquet), then specify the open text column that you want to use to generate topics. Click 'Extract topics' after you have selected the minimum similar documents per topic and maximum total topics. Duplicate this space, or clone to your computer to avoid queues here!
+    Uses fast TF-IDF-based embeddings by default, which are fast but not very performant in terms of cluster. Change to [Mixedbread large v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) model embeddings (512 dimensions, 8 bit quantisation) on the options page for topics of much higher quality, but slower processing time. If you have an embeddings .npz file previously made using this model, you can load this in at the same time to skip the first modelling step. If you have a pre-defined list of topics for zero-shot modelling, you can upload this as a csv file under 'I have my own list of topics...'. Further configuration options are available under the 'Options' tab. Topic representation with LLMs currently based on [Phi-3-mini-128k-instruct-GGUF](https://huggingface.co/QuantFactory/Phi-3-mini-128k-instruct-GGUF), which is quite slow on CPU, so use a GPU-enabled computer if possible, building from the requirements_gpu.txt file in the base folder.
+    For small datasets, consider breaking up your text into sentences under 'Clean data' -> 'Split open text...' before topic modelling.
     I suggest [Wikipedia mini dataset](https://huggingface.co/datasets/rag-datasets/mini_wikipedia/tree/main/data) for testing the tool here, choose passages.parquet.
     """)
                 clean_text = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Clean data - remove html, numbers with > 1 digits, emails, postcodes (UK), custom regex.")
                 drop_duplicate_text = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Remove duplicate text, drop < 50 char strings. May make old embedding files incompatible due to differing lengths.")
                 anonymise_drop = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Anonymise data on file load. Personal details are redacted - not 100% effective. This is slow!")
+                split_sentence_drop = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Split open text into sentences. Useful for small datasets.")
             with gr.Row():
                 custom_regex = gr.UploadButton(label="Import custom regex file", file_count="multiple")
+                gr.Markdown("""Import custom regex - csv table with one column of regex patterns with no header. Example pattern: (?i)roosevelt for case insensitive removal of this term.""")
                 custom_regex_text = gr.Textbox(label="Custom regex load status")
             clean_btn = gr.Button("Clean data")
     # Clean data
     custom_regex.upload(fn=custom_regex_load, inputs=[custom_regex], outputs=[custom_regex_text, custom_regex_state])
+    clean_btn.click(fn=pre_clean, inputs=[data_state, in_colnames, data_file_name_no_ext_state, custom_regex_state, clean_text, drop_duplicate_text, anonymise_drop, split_sentence_drop], outputs=[output_single_text, output_file, data_state, data_file_name_no_ext_state], api_name="clean")
     # Extract topics
     topics_btn.click(fn=extract_topics, inputs=[data_state, in_files, min_docs_slider, in_colnames, max_topics_slider, candidate_topics, data_file_name_no_ext_state, label_list_state, return_intermediate_files, embedding_super_compress, low_resource_mode_opt, save_topic_model, embeddings_state, embeddings_type_state, zero_shot_similarity, seed_number, calc_probs, vectoriser_state], outputs=[output_single_text, output_file, embeddings_state, embeddings_type_state, data_file_name_no_ext_state, topic_model_state, docs_state, vectoriser_state, assigned_topics_state], api_name="topics")
     # Visualise topics
     plot_btn.click(fn=visualise_topics, inputs=[topic_model_state, data_state, data_file_name_no_ext_state, low_resource_mode_opt, embeddings_state, in_label, in_colnames, legend_label, sample_slide, visualisation_type_radio, seed_number], outputs=[vis_output_single_text, out_plot_file, plot, plot_2], api_name="plot")
+# Launch the Gradio app
+if __name__ == "__main__":
+    block.queue().launch(show_error=True)#, server_name="0.0.0.0", ssl_verify=False, server_port=7860)

funcs/anonymiser.py CHANGED Viewed

@@ -8,19 +8,21 @@ def spacy_model_installed(model_name):
         import en_core_web_sm
         en_core_web_sm.load()
         print("Successfully imported spaCy model")
-        #nlp = spacy.load("en_core_web_sm")
         #print(nlp._path)
     except:
         download(model_name)
-        spacy.load(model_name)
         print("Successfully imported spaCy model")
     #print(nlp._path)
 #if not is_model_installed(model_name):
 #    os.system(f"python -m spacy download {model_name}")
 model_name = "en_core_web_sm"
-spacy_model_installed(model_name)
 #spacy.load(model_name)
 # Need to overwrite version of gradio present in Huggingface spaces as it doesn't have like buttons/avatars (Oct 2023)
@@ -41,7 +43,15 @@ from presidio_analyzer import AnalyzerEngine, BatchAnalyzerEngine, PatternRecogn
 from presidio_anonymizer import AnonymizerEngine, BatchAnonymizerEngine
 from presidio_anonymizer.entities import OperatorConfig
 def anon_consistent_names(df):
     # ## Pick out common names and replace them with the same person value

         import en_core_web_sm
         en_core_web_sm.load()
         print("Successfully imported spaCy model")
+        nlp = spacy.load("en_core_web_sm")
         #print(nlp._path)
     except:
         download(model_name)
+        nlp = spacy.load(model_name)
         print("Successfully imported spaCy model")
     #print(nlp._path)
+    return nlp
 #if not is_model_installed(model_name):
 #    os.system(f"python -m spacy download {model_name}")
 model_name = "en_core_web_sm"
+nlp = spacy_model_installed(model_name)
 #spacy.load(model_name)
 # Need to overwrite version of gradio present in Huggingface spaces as it doesn't have like buttons/avatars (Oct 2023)
 from presidio_anonymizer import AnonymizerEngine, BatchAnonymizerEngine
 from presidio_anonymizer.entities import OperatorConfig
+# Function to Split Text and Create DataFrame using SpaCy
+def expand_sentences_spacy(df, colname, nlp=nlp):
+    expanded_data = []
+    df = df.reset_index(names='index')
+    for index, row in df.iterrows():
+        doc = nlp(row[colname])
+        for sent in doc.sents:
+            expanded_data.append({'document_index': row['index'], colname: sent.text})
+    return pd.DataFrame(expanded_data)
 def anon_consistent_names(df):
     # ## Pick out common names and replace them with the same person value

funcs/embeddings.py CHANGED Viewed

@@ -47,7 +47,7 @@ def make_or_load_embeddings(docs, file_list, embeddings_out, embedding_model, em
                 print("Creating dense embeddings based on transformers model")
                 #embeddings_out = embedding_model.encode(sentences=docs, max_length=1024, show_progress_bar = True, batch_size = 32) # For Jina # #
-                embeddings_out = embedding_model.encode(sentences=docs, show_progress_bar = True, batch_size = 32) # For BGE
             toc = time.perf_counter()
             time_out = f"The embedding took {toc - tic:0.1f} seconds"

                 print("Creating dense embeddings based on transformers model")
                 #embeddings_out = embedding_model.encode(sentences=docs, max_length=1024, show_progress_bar = True, batch_size = 32) # For Jina # #
+                embeddings_out = embedding_model.encode(sentences=docs, show_progress_bar = True, batch_size = 32, precision="int8") # For large
             toc = time.perf_counter()
             time_out = f"The embedding took {toc - tic:0.1f} seconds"

funcs/helper_functions.py CHANGED Viewed

@@ -144,7 +144,7 @@ def custom_regex_load(in_file):
     regex_file_names = [string for string in file_list if "csv" in string.lower()]
     if regex_file_names:
         regex_file_name = regex_file_names[0]
-        custom_regex = read_file(regex_file_name)
         #regex_file_name_no_ext = get_file_path_end(regex_file_name)
         output_text = "Data file loaded."

     regex_file_names = [string for string in file_list if "csv" in string.lower()]
     if regex_file_names:
         regex_file_name = regex_file_names[0]
+        custom_regex = pd.read_csv(regex_file_name, low_memory=False, header=None)
         #regex_file_name_no_ext = get_file_path_end(regex_file_name)
         output_text = "Data file loaded."

funcs/prompts.py CHANGED Viewed

@@ -103,4 +103,36 @@ Topic label:"""
 stablelm_prompt = stablelm_example_prompt + stablelm_main_prompt
-#print("StableLM prompt: ", stablelm_prompt)

 stablelm_prompt = stablelm_example_prompt + stablelm_main_prompt
+#print("StableLM prompt: ", stablelm_prompt)
+phi3_start = "<|user|>"
+phi3_example_prompt = """<|user|>
+I have a topic that contains the following documents:
+- Traditional diets in most cultures were primarily plant-based with a little meat on top, but with the rise of industrial style meat production and factory farming, meat has become a staple food.
+- Meat, but especially beef, is the word food in terms of emissions.
+- Eating meat doesn't make you a bad person, not eating meat doesn't make you a good one.
+The topic is described by the following keywords: 'meat, beef, eat, eating, emissions, steak, food, health, processed, chicken'.
+Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.
+Topic label: Environmental impacts of eating meat
+"""
+# Our main prompt with documents ([DOCUMENTS]) and keywords ([KEYWORDS]) tags
+phi3_main_prompt = """
+Now, create a new topic label given the following information.
+I have a topic that contains the following documents:
+[DOCUMENTS]
+The topic is described by the following keywords: '[KEYWORDS]'.
+Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.<|end|>
+<|assistant|>
+Topic label:"""
+phi3_prompt = phi3_example_prompt + phi3_main_prompt
+#print("phi3 prompt: ", phi3_prompt)

funcs/representation_model.py CHANGED Viewed

@@ -6,12 +6,12 @@ import torch.cuda
 from huggingface_hub import hf_hub_download, snapshot_download
 from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, BaseRepresentation
-from funcs.prompts import capybara_prompt, capybara_start, open_hermes_prompt, open_hermes_start, stablelm_prompt, stablelm_start
 random_seed = 42
-chosen_prompt = open_hermes_prompt # stablelm_prompt
-chosen_start_tag =  open_hermes_start # stablelm_start
 # Currently set n_gpu_layers to 0 even with cuda due to persistent bugs in implementation with cuda
@@ -91,13 +91,14 @@ mmr = MaximalMarginalRelevance(diversity=0.5)
 base_rep = BaseRepresentation()
 # Find model file
-def find_model_file(hf_model_name, hf_model_file, search_folder):
     hf_loc = search_folder #os.environ["HF_HOME"]
-    hf_sub_loc = search_folder + "/hub/" #os.environ["HF_HOME"]
-    hf_model_name_path = hf_sub_loc + 'models--' + hf_model_name.replace("/","--")
-    print(hf_model_name_path)
     def find_file(root_folder, file_name):
         for root, dirs, files in os.walk(root_folder):
@@ -109,36 +110,11 @@ def find_model_file(hf_model_name, hf_model_file, search_folder):
     folder_path = hf_model_name_path  # Replace with your folder path
     file_to_find = hf_model_file         # Replace with the file name you're looking for
-    found_file = find_file(folder_path, file_to_find) # os.environ["HF_HOME"]
-    if found_file:
-        print(f"Model file found: {found_file}")
-        return found_file
-    else:
-        error = "File not found."
-        print(error, " Downloading model from hub")
-        # Specify your custom directory
-        # Get HF_HOME environment variable or default to "~/.cache/huggingface/hub"
-        #hf_home_value = search_folder
-        # Check if the directory exists, create it if it doesn't
-        #if not os.path.exists(hf_home_value):
-        #    os.makedirs(hf_home_value)
-        found_file = hf_hub_download(repo_id=hf_model_name, filename=hf_model_file)#, local_dir=hf_home_value) # cache_dir
-        #path = snapshot_download(
-        #    repo_id=hf_model_name,
-        #    allow_patterns="config.json",
-        #    local_files_only=False
-        #)
-        print("Downloaded model to: ", found_file)
-        #found_file = find_file(path, file_to_find)
-        return found_file
 def create_representation_model(representation_type, llm_config, hf_model_name, hf_model_file, chosen_start_tag, low_resource_mode):
@@ -151,7 +127,7 @@ def create_representation_model(representation_type, llm_config, hf_model_name,
         # Check for HF_HOME environment variable and supply a default value if it's not found (typical location for huggingface models)
         # Get HF_HOME environment variable or default to "~/.cache/huggingface/hub"
-        base_folder = "." #"~/.cache/huggingface/hub"
         hf_home_value = os.getenv("HF_HOME", base_folder)
         # Expand the user symbol '~' to the full home directory path
@@ -162,12 +138,29 @@ def create_representation_model(representation_type, llm_config, hf_model_name,
         if not os.path.exists(hf_home_value):
             os.makedirs(hf_home_value)
-        print(hf_home_value)
-        found_file = find_model_file(hf_model_name, hf_model_file,  hf_home_value)
-        llm = Llama(model_path=found_file, stop=chosen_start_tag, n_gpu_layers=llm_config.n_gpu_layers, n_ctx=llm_config.n_ctx, rope_freq_scale=0.5, seed=seed) #**llm_config.model_dump())#
         #print(llm.n_gpu_layers)
         llm_model = LlamaCPP(llm, prompt=chosen_prompt)#, **gen_config.model_dump())
         # All representation models

 from huggingface_hub import hf_hub_download, snapshot_download
 from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, BaseRepresentation
+from funcs.prompts import capybara_prompt, capybara_start, open_hermes_prompt, open_hermes_start, stablelm_prompt, stablelm_start, phi3_prompt, phi3_start
 random_seed = 42
+chosen_prompt = phi3_prompt #open_hermes_prompt # stablelm_prompt
+chosen_start_tag =  phi3_start #open_hermes_start # stablelm_start
 # Currently set n_gpu_layers to 0 even with cuda due to persistent bugs in implementation with cuda
 base_rep = BaseRepresentation()
 # Find model file
+def find_model_file(hf_model_name, hf_model_file, search_folder, sub_folder):
     hf_loc = search_folder #os.environ["HF_HOME"]
+    hf_sub_loc = search_folder + sub_folder #os.environ["HF_HOME"]
+    if sub_folder == "/hub/":
+        hf_model_name_path = hf_sub_loc + 'models--' + hf_model_name.replace("/","--")
+    else:
+        hf_model_name_path = hf_sub_loc
     def find_file(root_folder, file_name):
         for root, dirs, files in os.walk(root_folder):
     folder_path = hf_model_name_path  # Replace with your folder path
     file_to_find = hf_model_file         # Replace with the file name you're looking for
+    print("Searching for model file", hf_model_file, "in:", hf_model_name_path)
+    found_file = find_file(folder_path, file_to_find) # os.environ["HF_HOME"]
+    return found_file
 def create_representation_model(representation_type, llm_config, hf_model_name, hf_model_file, chosen_start_tag, low_resource_mode):
         # Check for HF_HOME environment variable and supply a default value if it's not found (typical location for huggingface models)
         # Get HF_HOME environment variable or default to "~/.cache/huggingface/hub"
+        base_folder = "model" #"~/.cache/huggingface/hub"
         hf_home_value = os.getenv("HF_HOME", base_folder)
         # Expand the user symbol '~' to the full home directory path
         if not os.path.exists(hf_home_value):
             os.makedirs(hf_home_value)
+        print("Searching base folder for model:", hf_home_value)
+        found_file = find_model_file(hf_model_name, hf_model_file,  hf_home_value, "/rep/")
+        if found_file:
+            print(f"Model file found in model folder: {found_file}")
+        else:
+            found_file = find_model_file(hf_model_name, hf_model_file,  hf_home_value, "/hub/")
+        if not found_file:
+            error = "File not found in HF hub directory or in local model file."
+            print(error, " Downloading model from hub")
+            found_file = hf_hub_download(repo_id=hf_model_name, filename=hf_model_file)#, local_dir=hf_home_value) # cache_dir
+            print("Downloaded model from Huggingface Hub to: ", found_file)
+        print("Loading representation model with", llm_config.n_gpu_layers, "layers allocated to GPU.")
+        llm = Llama(model_path=found_file, stop=chosen_start_tag, n_gpu_layers=llm_config.n_gpu_layers, n_ctx=llm_config.n_ctx,seed=seed) #**llm_config.model_dump())#  rope_freq_scale=0.5,
         #print(llm.n_gpu_layers)
+        print("Chosen prompt:", chosen_prompt)
         llm_model = LlamaCPP(llm, prompt=chosen_prompt)#, **gen_config.model_dump())
         # All representation models

funcs/topic_core_funcs.py CHANGED Viewed

@@ -9,6 +9,7 @@ import time
 from bertopic import BERTopic
 from funcs.clean_funcs import initial_clean
 from funcs.helper_functions import read_file, zip_folder, delete_files_in_folder, save_topic_outputs
 from funcs.embeddings import make_or_load_embeddings
 from funcs.bertopic_vis_documents import visualize_documents_custom, visualize_hierarchical_documents_custom, hierarchical_topics_custom, visualize_hierarchy_custom
@@ -47,13 +48,13 @@ today = datetime.now().strftime("%d%m%Y")
 today_rev = datetime.now().strftime("%Y%m%d")
 # Load embeddings
-embeddings_name = "BAAI/bge-small-en-v1.5" #"jinaai/jina-embeddings-v2-base-en"
 # LLM model used for representing topics
-hf_model_name =  'second-state/stablelm-2-zephyr-1.6b-GGUF' #'TheBloke/phi-2-orange-GGUF' #'NousResearch/Nous-Capybara-7B-V1.9-GGUF'
-hf_model_file =   'stablelm-2-zephyr-1_6b-Q5_K_M.gguf' # 'phi-2-orange.Q5_K_M.gguf' #'Capybara-7B-V1.9-Q5_K_M.gguf'
-def pre_clean(data, in_colnames, data_file_name_no_ext, custom_regex, clean_text, drop_duplicate_text, anonymise_drop, progress=gr.Progress(track_tqdm=True)):
     output_text = ""
     output_list = []
@@ -116,6 +117,19 @@ def pre_clean(data, in_colnames, data_file_name_no_ext, custom_regex, clean_text
         anon_toc = time.perf_counter()
         time_out = f"Anonymising text took {anon_toc - anon_tic:0.1f} seconds"
     out_data_name = data_file_name_no_ext + "_" + today_rev +  ".csv"
     data.to_csv(out_data_name)
     output_list.append(out_data_name)
@@ -159,15 +173,36 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
     print("Low resource mode: ", low_resource_mode)
     if low_resource_mode == "No":
-        print("Using high resource BGE transformer model")
-        embedding_model = SentenceTransformer(embeddings_name)
         # If tfidf embeddings currently exist, wipe these empty
         if embeddings_type_state == "tfidf":
             embeddings_out = np.array([])
-        embeddings_type_state = "bge"
         # UMAP model uses Bertopic defaults
         umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', low_memory=False, random_state=random_seed)
@@ -180,8 +215,8 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
                 TruncatedSVD(100, random_state=random_seed)
                 )
-        # If bge embeddings currently exist, wipe these empty, then rename embeddings type
-        if embeddings_type_state == "bge":
             embeddings_out = np.array([])
         embeddings_type_state = "tfidf"
@@ -316,9 +351,9 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
             embeddings_file_name = data_file_name_no_ext + '_' + 'tfidf_embeddings.npz'
         else:
             if embeddings_super_compress == "No":
-                embeddings_file_name = data_file_name_no_ext + '_' + 'bge_embeddings.npz'
             else:
-                embeddings_file_name = data_file_name_no_ext + '_' + 'bge_embeddings_compress.npz'
         np.savez_compressed(embeddings_file_name, embeddings_out)
@@ -516,7 +551,7 @@ def visualise_topics(topic_model, data, data_file_name_no_ext, low_resource_mode
         #try:
-        topics_vis, hierarchy_df, hierarchy_topic_names = visualize_hierarchical_documents_custom(topic_model, docs, label_list, hierarchical_topics, reduced_embeddings=reduced_embeddings, sample = sample_prop, hide_document_hover= False, custom_labels=True, width= 1200, height = 750)
         topics_vis_2 = visualize_hierarchy_custom(topic_model, hierarchical_topics=hierarchical_topics, width= 1200, height = 750)
         # Write hierarchical topics levels to df

 from bertopic import BERTopic
 from funcs.clean_funcs import initial_clean
+from funcs.anonymiser import expand_sentences_spacy
 from funcs.helper_functions import read_file, zip_folder, delete_files_in_folder, save_topic_outputs
 from funcs.embeddings import make_or_load_embeddings
 from funcs.bertopic_vis_documents import visualize_documents_custom, visualize_hierarchical_documents_custom, hierarchical_topics_custom, visualize_hierarchy_custom
 today_rev = datetime.now().strftime("%Y%m%d")
 # Load embeddings
+embeddings_name = "mixedbread-ai/mxbai-embed-large-v1" #"BAAI/large-small-en-v1.5" #"jinaai/jina-embeddings-v2-base-en"
 # LLM model used for representing topics
+hf_model_name =  "QuantFactory/Phi-3-mini-128k-instruct-GGUF"#'second-state/stablelm-2-zephyr-1.6b-GGUF' #'TheBloke/phi-2-orange-GGUF' #'NousResearch/Nous-Capybara-7B-V1.9-GGUF'
+hf_model_file =   "Phi-3-mini-128k-instruct.Q4_K_M.gguf"#'stablelm-2-zephyr-1_6b-Q5_K_M.gguf' # 'phi-2-orange.Q5_K_M.gguf' #'Capybara-7B-V1.9-Q5_K_M.gguf'
+def pre_clean(data, in_colnames, data_file_name_no_ext, custom_regex, clean_text, drop_duplicate_text, anonymise_drop, sentence_split_drop, progress=gr.Progress(track_tqdm=True)):
     output_text = ""
     output_list = []
         anon_toc = time.perf_counter()
         time_out = f"Anonymising text took {anon_toc - anon_tic:0.1f} seconds"
+    if sentence_split_drop == "Yes":
+        progress(0.6, desc= "Splitting text into sentences")
+        data_file_name_no_ext = data_file_name_no_ext + "_split"
+        anon_tic = time.perf_counter()
+        data = expand_sentences_spacy(data, in_colnames_list_first)
+        data = data[data[in_colnames_list_first].str.len() >= 5] # Keep only rows with at least 5 characters
+        anon_toc = time.perf_counter()
+        time_out = f"Anonymising text took {anon_toc - anon_tic:0.1f} seconds"
     out_data_name = data_file_name_no_ext + "_" + today_rev +  ".csv"
     data.to_csv(out_data_name)
     output_list.append(out_data_name)
     print("Low resource mode: ", low_resource_mode)
     if low_resource_mode == "No":
+        print("Using high resource embedding model")
+        # Define a list of possible local locations to search for the model
+        local_embeddings_locations = [
+            "model/embed/", # Potential local location
+            "/model/embed/", # Potential location in Docker container
+            "/home/user/app/model/embed/" # This is inside a Docker container
+        ]
+        # Attempt to load the model from each local location
+        for location in local_embeddings_locations:
+            try:
+                embedding_model = SentenceTransformer(location, truncate_dim=512)
+                print(f"Found local model installation at: {location}")
+                break  # Exit the loop if the model is found
+            except Exception as e:
+                print(f"Failed to load model from {location}: {e}")
+                continue
+        else:
+            # If the loop completes without finding the model in any local location
+            embedding_model = SentenceTransformer(embeddings_name, truncate_dim=512)
+            print("Could not find local model installation. Downloading from Huggingface")
+        #embedding_model = SentenceTransformer(embeddings_name, truncate_dim=512)
         # If tfidf embeddings currently exist, wipe these empty
         if embeddings_type_state == "tfidf":
             embeddings_out = np.array([])
+        embeddings_type_state = "large"
         # UMAP model uses Bertopic defaults
         umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', low_memory=False, random_state=random_seed)
                 TruncatedSVD(100, random_state=random_seed)
                 )
+        # If large embeddings currently exist, wipe these empty, then rename embeddings type
+        if embeddings_type_state == "large":
             embeddings_out = np.array([])
         embeddings_type_state = "tfidf"
             embeddings_file_name = data_file_name_no_ext + '_' + 'tfidf_embeddings.npz'
         else:
             if embeddings_super_compress == "No":
+                embeddings_file_name = data_file_name_no_ext + '_' + 'large_embeddings.npz'
             else:
+                embeddings_file_name = data_file_name_no_ext + '_' + 'large_embeddings_compress.npz'
         np.savez_compressed(embeddings_file_name, embeddings_out)
         #try:
+        topics_vis, hierarchy_df, hierarchy_topic_names = visualize_hierarchical_documents_custom(topic_model, docs, label_list, hierarchical_topics, hide_annotations=True, reduced_embeddings=reduced_embeddings, sample = sample_prop, hide_document_hover= False, custom_labels=True, width= 1200, height = 750)
         topics_vis_2 = visualize_hierarchy_custom(topic_model, hierarchical_topics=hierarchical_topics, width= 1200, height = 750)
         # Write hierarchical topics levels to df

requirements.txt CHANGED Viewed

@@ -1,15 +1,16 @@
-gradio==4.16.0
-transformers==4.37.1
 accelerate==0.26.1
-torch==2.1.2
-llama-cpp-python==0.2.36
-bertopic==0.16.0
-spacy==3.7.2
 en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz
 pyarrow==14.0.2
 openpyxl==3.1.2
 Faker==22.2.0
-presidio_analyzer==2.2.351
-presidio_anonymizer==2.2.351
 scipy==1.11.4
 polars==0.20.6

+gradio
+transformers==4.41.2
 accelerate==0.26.1
+torch==2.3.1
+llama-cpp-python==0.2.79
+bertopic==0.16.2
+spacy==3.7.4
 en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz
 pyarrow==14.0.2
 openpyxl==3.1.2
 Faker==22.2.0
+presidio_analyzer==2.2.354
+presidio_anonymizer==2.2.354
 scipy==1.11.4
 polars==0.20.6
+numpy==1.26.4

requirements_gpu.txt ADDED Viewed

	@@ -0,0 +1,17 @@

+gradio
+transformers==4.41.2
+accelerate==0.26.1
+torch==2.3.1
+bertopic==0.16.2
+spacy==3.7.4
+en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz
+pyarrow==14.0.2
+openpyxl==3.1.2
+Faker==22.2.0
+presidio_analyzer==2.2.354
+presidio_anonymizer==2.2.354
+scipy==1.11.4
+polars==0.20.6
+torch --index-url https://download.pytorch.org/whl/cu121
+llama-cpp-python==0.2.77 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
+numpy==1.26.4