Spaces:

spark-nlp
/

CamemBERT

Running

App Files Files Community

abdullahmubeen10 commited on Aug 19, 2024

Commit

6b63571

verified ·

1 Parent(s): 0f67cc0

Upload 5 files

Browse files

Files changed (5) hide show

.streamlit/config.toml +3 -0
Demo.py +157 -0
Dockerfile +70 -0
pages/Workflow & Model Overview.py +249 -0
requirements.txt +7 -0

.streamlit/config.toml ADDED Viewed

	@@ -0,0 +1,3 @@

+[theme]
+base="light"
+primaryColor="#29B4E8"

Demo.py ADDED Viewed

	@@ -0,0 +1,157 @@

+import streamlit as st
+import sparknlp
+import os
+import pandas as pd
+from sparknlp.base import *
+from sparknlp.annotator import *
+from pyspark.ml import Pipeline
+from sparknlp.pretrained import PretrainedPipeline
+from annotated_text import annotated_text
+# Page configuration
+st.set_page_config(
+    layout="wide",
+    initial_sidebar_state="auto"
+)
+# CSS for styling
+st.markdown("""
+    <style>
+        .main-title {
+            font-size: 36px;
+            color: #4A90E2;
+            font-weight: bold;
+            text-align: center;
+        }
+        .section {
+            background-color: #f9f9f9;
+            padding: 10px;
+            border-radius: 10px;
+            margin-top: 10px;
+        }
+        .section p, .section ul {
+            color: #666666;
+        }
+    </style>
+""", unsafe_allow_html=True)
+@st.cache_resource
+def init_spark():
+    return sparknlp.start()
+@st.cache_resource
+def create_pipeline(model):
+    document_assembler = DocumentAssembler() \
+        .setInputCol('text') \
+        .setOutputCol('document')
+    tokenizer = Tokenizer() \
+        .setInputCols(['document']) \
+        .setOutputCol('token')
+    tokenClassifier = CamemBertForTokenClassification() \
+        .pretrained(model, 'en') \
+        .setInputCols(['document', 'token']) \
+        .setOutputCol('ner') \
+        .setCaseSensitive(True) \
+        .setMaxSentenceLength(512)
+    # Convert NER labels to entities
+    ner_converter = NerConverter() \
+        .setInputCols(['document', 'token', 'ner']) \
+        .setOutputCol('ner_chunk')
+    pipeline = Pipeline(stages=[
+        document_assembler,
+        tokenizer,
+        tokenClassifier,
+        ner_converter
+    ])
+    return pipeline
+def fit_data(pipeline, data):
+  empty_df = spark.createDataFrame([['']]).toDF('text')
+  pipeline_model = pipeline.fit(empty_df)
+  model = LightPipeline(pipeline_model)
+  result = model.fullAnnotate(data)
+  return result
+def annotate(data):
+    document, chunks, labels = data["Document"], data["NER Chunk"], data["NER Label"]
+    annotated_words = []
+    for chunk, label in zip(chunks, labels):
+        parts = document.split(chunk, 1)
+        if parts[0]:
+            annotated_words.append(parts[0])
+        annotated_words.append((chunk, label))
+        document = parts[1]
+    if document:
+        annotated_words.append(document)
+    annotated_text(*annotated_words)
+# Set up the page layout
+st.markdown('<div class="main-title">Recognize Entities with CamemBERT</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <p>This model performs Named Entity Recognition (NER) using CamemBERT, a powerful language model fine-tuned specifically for French. It can accurately identify entities such as locations, organizations, persons, and miscellaneous categories in texts.</p>
+</div>
+""", unsafe_allow_html=True)
+# Sidebar content
+model = st.sidebar.selectbox(
+    "Choose the pretrained model",
+    ['camembert_base_token_classifier_wikiner'],
+    help="For more info about the models visit: https://sparknlp.org/models"
+)
+# Reference notebook link in sidebar
+link = """
+<a href="https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/5cd574dd8065d3d7406816bee36b1ef56b3f9359/Spark_NLP_Udemy_MOOC/Open_Source/17.01.Transformers-based_Embeddings.ipynb#L102">
+    <img src="https://colab.research.google.com/assets/colab-badge.svg" style="zoom: 1.3" alt="Open In Colab"/>
+</a>
+"""
+st.sidebar.markdown('Reference notebook:')
+st.sidebar.markdown(link, unsafe_allow_html=True)
+# Load examples
+# English and French text samples for testing the CamemBERT model
+examples = [
+    """Barack Obama was born in Hawaii and later became the 44th President of the United States. He attended Columbia University and Harvard Law School, where he served as the president of the Harvard Law Review. After graduation, he worked as a civil rights attorney and taught constitutional law at the University of Chicago Law School. Obama's presidential campaign began in 2007, and he was elected as the first African American president in 2008. During his presidency, he signed into law the Affordable Care Act, passed the Dodd-Frank Act, and ordered the military operation that resulted in the death of Osama bin Laden.""",
+    """Paris is the capital of France and one of the most visited cities in the world. The Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral are among its most iconic landmarks. The city is also a global center for art, fashion, gastronomy, and culture. In addition to its historical sites, Paris is known for its cafés, parks, and gardens. The River Seine runs through the city, adding to its charm and providing picturesque views. Paris has been a major hub for education, politics, and commerce for centuries.""",
+    """Apple Inc. is an American multinational technology company headquartered in Cupertino, California. It was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976. Apple is known for its innovative products, including the iPhone, iPad, and Mac computers. The company has a significant presence in Paris, where it operates several retail stores and offices. Apple's commitment to design and user experience has made it one of the most valuable companies in the world. The company continues to lead the industry in technology and sustainability initiatives."""
+    """Barack Obama est né à Hawaï et est ensuite devenu le 44e président des États-Unis. Il a étudié à l'Université Columbia et à la Faculté de droit de Harvard, où il a été président de la Harvard Law Review. Après avoir obtenu son diplôme, il a travaillé comme avocat spécialisé en droits civiques et a enseigné le droit constitutionnel à la Faculté de droit de l'Université de Chicago. La campagne présidentielle d'Obama a commencé en 2007, et il a été élu premier président afro-américain en 2008. Pendant sa présidence, il a promulgué la loi sur les soins abordables, fait adopter la loi Dodd-Frank, et ordonné l'opération militaire qui a conduit à la mort d'Oussama ben Laden.""",
+    """Paris est la capitale de la France et l'une des villes les plus visitées au monde. La Tour Eiffel, le Musée du Louvre et la Cathédrale Notre-Dame comptent parmi ses monuments les plus emblématiques. La ville est également un centre mondial de l'art, de la mode, de la gastronomie et de la culture. En plus de ses sites historiques, Paris est connue pour ses cafés, ses parcs et ses jardins. La Seine traverse la ville, ajoutant à son charme et offrant des vues pittoresques. Paris est depuis des siècles un important centre d'éducation, de politique et de commerce.""",
+    """Apple Inc. est une multinationale technologique américaine dont le siège est à Cupertino, en Californie. Elle a été fondée par Steve Jobs, Steve Wozniak et Ronald Wayne en avril 1976. Apple est connue pour ses produits innovants, notamment l'iPhone, l'iPad et les ordinateurs Mac. La société a une présence importante à Paris, où elle exploite plusieurs magasins de détail et bureaux. L'engagement d'Apple en matière de design et d'expérience utilisateur en a fait l'une des entreprises les plus précieuses au monde. La société continue de diriger l'industrie en matière de technologie et d'initiatives de durabilité."""
+]
+selected_text = st.selectbox("Select an example", examples)
+custom_input = st.text_input("Try it with your own Sentence!")
+text_to_analyze = custom_input if custom_input else selected_text
+st.subheader('Full example text')
+HTML_WRAPPER = """<div class="scroll entities" style="overflow-x: auto; border: 1px solid #e6e9ef; border-radius: 0.25rem; padding: 1rem; margin-bottom: 2.5rem; white-space:pre-wrap">{}</div>"""
+st.markdown(HTML_WRAPPER.format(text_to_analyze), unsafe_allow_html=True)
+# Initialize Spark and create pipeline
+spark = init_spark()
+pipeline = create_pipeline(model)
+output = fit_data(pipeline, text_to_analyze)
+# Display matched sentence
+st.subheader("Processed output:")
+results = {
+    'Document': output[0]['document'][0].result,
+    'NER Chunk': [n.result for n in output[0]['ner_chunk']],
+    "NER Label": [n.metadata['entity'] for n in output[0]['ner_chunk']]
+}
+annotate(results)
+with st.expander("View DataFrame"):
+    df = pd.DataFrame({'NER Chunk': results['NER Chunk'], 'NER Label': results['NER Label']})
+    df.index += 1
+    st.dataframe(df)

Dockerfile ADDED Viewed

	@@ -0,0 +1,70 @@

+# Download base image ubuntu 18.04
+FROM ubuntu:18.04
+# Set environment variables
+ENV NB_USER jovyan
+ENV NB_UID 1000
+ENV HOME /home/${NB_USER}
+# Install required packages
+RUN apt-get update && apt-get install -y \
+    tar \
+    wget \
+    bash \
+    rsync \
+    gcc \
+    libfreetype6-dev \
+    libhdf5-serial-dev \
+    libpng-dev \
+    libzmq3-dev \
+    python3 \
+    python3-dev \
+    python3-pip \
+    unzip \
+    pkg-config \
+    software-properties-common \
+    graphviz \
+    openjdk-8-jdk \
+    ant \
+    ca-certificates-java \
+    && apt-get clean \
+    && update-ca-certificates -f;
+# Install Python 3.8 and pip
+RUN add-apt-repository ppa:deadsnakes/ppa \
+    && apt-get update \
+    && apt-get install -y python3.8 python3-pip \
+    && apt-get clean;
+# Set up JAVA_HOME
+ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/
+RUN mkdir -p ${HOME} \
+    && echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/" >> ${HOME}/.bashrc \
+    && chown -R ${NB_UID}:${NB_UID} ${HOME}
+# Create a new user named "jovyan" with user ID 1000
+RUN useradd -m -u ${NB_UID} ${NB_USER}
+# Switch to the "jovyan" user
+USER ${NB_USER}
+# Set home and path variables for the user
+ENV HOME=/home/${NB_USER} \
+    PATH=/home/${NB_USER}/.local/bin:$PATH
+# Set the working directory to the user's home directory
+WORKDIR ${HOME}
+# Upgrade pip and install Python dependencies
+RUN python3.8 -m pip install --upgrade pip
+COPY requirements.txt /tmp/requirements.txt
+RUN python3.8 -m pip install -r /tmp/requirements.txt
+# Copy the application code into the container at /home/jovyan
+COPY --chown=${NB_USER}:${NB_USER} . ${HOME}
+# Expose port for Streamlit
+EXPOSE 7860
+# Define the entry point for the container
+ENTRYPOINT ["streamlit", "run", "Demo.py", "--server.port=7860", "--server.address=0.0.0.0"]

pages/Workflow & Model Overview.py ADDED Viewed

	@@ -0,0 +1,249 @@

+import streamlit as st
+# Page configuration
+st.set_page_config(
+    layout="wide",
+    initial_sidebar_state="auto"
+)
+# Custom CSS for better styling
+st.markdown("""
+    <style>
+        .main-title {
+            font-size: 36px;
+            color: #4A90E2;
+            font-weight: bold;
+            text-align: center;
+        }
+        .sub-title {
+            font-size: 24px;
+            color: #4A90E2;
+            margin-top: 20px;
+        }
+        .section {
+            background-color: #f9f9f9;
+            padding: 15px;
+            border-radius: 10px;
+            margin-top: 20px;
+        }
+        .section h2 {
+            font-size: 22px;
+            color: #4A90E2;
+        }
+        .section p, .section ul {
+            color: #666666;
+        }
+        .link {
+            color: #4A90E2;
+            text-decoration: none;
+        }
+        .benchmark-table {
+            width: 100%;
+            border-collapse: collapse;
+            margin-top: 20px;
+        }
+        .benchmark-table th, .benchmark-table td {
+            border: 1px solid #ddd;
+            padding: 8px;
+            text-align: left;
+        }
+        .benchmark-table th {
+            background-color: #4A90E2;
+            color: white;
+        }
+        .benchmark-table td {
+            background-color: #f2f2f2;
+        }
+    </style>
+""", unsafe_allow_html=True)
+# Title
+st.markdown('<div class="main-title">Introduction to CamemBERT Annotators in Spark NLP</div>', unsafe_allow_html=True)
+# Subtitle
+st.markdown("""
+<div class="section">
+    <p>Spark NLP offers a variety of CamemBERT-based annotators tailored for multiple natural language processing tasks. CamemBERT is a robust and versatile model designed specifically for the French language, offering state-of-the-art performance in a range of NLP applications. Below, we provide an overview of the four key CamemBERT annotators:</p>
+</div>
+""", unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <h2>CamemBERT for Token Classification</h2>
+    <p>The <strong>CamemBertForTokenClassification</strong> annotator is designed for Named Entity Recognition (NER) tasks using CamemBERT, a French language model derived from RoBERTa. This model efficiently handles token classification, which involves labeling tokens in a text with tags that correspond to specific entities. CamemBERT offers robust performance in French NLP tasks, making it a valuable tool for real-time applications in this language.</p>
+    <p>Token classification with CamemBERT enables:</p>
+    <ul>
+        <li><strong>Named Entity Recognition (NER):</strong> Identifying and classifying entities such as names, organizations, locations, and other predefined categories.</li>
+        <li><strong>Information Extraction:</strong> Extracting key information from unstructured text for further analysis.</li>
+        <li><strong>Text Categorization:</strong> Enhancing document retrieval and categorization based on entity recognition.</li>
+    </ul>
+    <p>Here is an example of how CamemBERT token classification works:</p>
+    <table class="benchmark-table">
+        <tr>
+            <th>Entity</th>
+            <th>Label</th>
+        </tr>
+        <tr>
+            <td>Paris</td>
+            <td>LOC</td>
+        </tr>
+        <tr>
+            <td>Emmanuel Macron</td>
+            <td>PER</td>
+        </tr>
+        <tr>
+            <td>Élysée Palace</td>
+            <td>ORG</td>
+        </tr>
+    </table>
+</div>
+""", unsafe_allow_html=True)
+# CamemBERT Token Classification - French WikiNER
+st.markdown('<div class="sub-title">CamemBERT Token Classification - French WikiNER</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <p>The <strong>camembert_base_token_classifier_wikiner</strong> is a fine-tuned CamemBERT model for token classification tasks, specifically adapted for Named Entity Recognition (NER) on the French WikiNER dataset. It is designed to recognize five types of entities: O, LOC, PER, MISC, and ORG.</p>
+</div>
+""", unsafe_allow_html=True)
+# How to Use the Model - Token Classification
+st.markdown('<div class="sub-title">How to Use the Model</div>', unsafe_allow_html=True)
+st.code('''
+from sparknlp.base import *
+from sparknlp.annotator import *
+from pyspark.ml import Pipeline
+from pyspark.sql.functions import col, expr
+document_assembler = DocumentAssembler() \\
+    .setInputCol('text') \\
+    .setOutputCol('document')
+tokenizer = Tokenizer() \\
+    .setInputCols(['document']) \\
+    .setOutputCol('token')
+tokenClassifier = CamemBertForTokenClassification \\
+    .pretrained('camembert_base_token_classifier_wikiner', 'en') \\
+    .setInputCols(['document', 'token']) \\
+    .setOutputCol('ner') \\
+    .setCaseSensitive(True) \\
+    .setMaxSentenceLength(512)
+# Convert NER labels to entities
+ner_converter = NerConverter() \\
+    .setInputCols(['document', 'token', 'ner']) \\
+    .setOutputCol('entities')
+pipeline = Pipeline(stages=[
+    document_assembler,
+    tokenizer,
+    tokenClassifier,
+    ner_converter
+])
+data = spark.createDataFrame([["""Paris est la capitale de la France et abrite le Président Emmanuel Macron, qui réside au palais de l'Élysée. Apple Inc. a une présence significative dans la ville."""]]).toDF("text")
+result = pipeline.fit(data).transform(data)
+result.select(
+    expr("explode(entities) as ner_chunk")
+).select(
+    col("ner_chunk.result").alias("chunk"),
+    col("ner_chunk.metadata.entity").alias("ner_label")
+).show(truncate=False)
+''', language='python')
+# Results
+st.text("""
++------------------+---------+
+|chunk             |ner_label|
++------------------+---------+
+|Paris             |LOC      |
+|France            |LOC      |
+|Emmanuel Macron   |PER      |
+|Élysée Palace     |ORG      |
+|Apple Inc.        |ORG      |
++------------------+---------+
+""")
+# Performance Metrics
+st.markdown('<div class="sub-title">Performance Metrics</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <p>Here are the detailed performance metrics for the CamemBERT token classification model:</p>
+    <table class="benchmark-table">
+        <tr>
+            <th>Entity</th>
+            <th>Precision</th>
+            <th>Recall</th>
+            <th>F1-Score</th>
+        </tr>
+        <tr>
+            <td>LOC</td>
+            <td>0.93</td>
+            <td>0.94</td>
+            <td>0.94</td>
+        </tr>
+        <tr>
+            <td>PER</td>
+            <td>0.95</td>
+            <td>0.95</td>
+            <td>0.95</td>
+        </tr>
+        <tr>
+            <td>ORG</td>
+            <td>0.92</td>
+            <td>0.91</td>
+            <td>0.91</td>
+        </tr>
+        <tr>
+            <td>MISC</td>
+            <td>0.86</td>
+            <td>0.85</td>
+            <td>0.85</td>
+        </tr>
+        <tr>
+            <td>O</td>
+            <td>0.99</td>
+            <td>0.99</td>
+            <td>0.99</td>
+        </tr>
+        <tr>
+            <td>Overall</td>
+            <td>0.97</td>
+            <td>0.98</td>
+            <td>0.98</td>
+        </tr>
+    </table>
+</div>
+""", unsafe_allow_html=True)
+# Model Information - Token Classification
+st.markdown('<div class="sub-title">Model Information</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <ul>
+        <li><strong>Model Name:</strong> camembert_base_token_classifier_wikiner</li>
+        <li><strong>Compatibility:</strong> Spark NLP 4.2.0+</li>
+        <li><strong>License:</strong> Open Source</li>
+        <li><strong>Edition:</strong> Official</li>
+        <li><strong>Input Labels:</strong> [token, document]</li>
+        <li><strong>Output Labels:</strong> [ner]</li>
+        <li><strong>Language:</strong> French</li>
+        <li><strong>Size:</strong> 412.2 MB</li>
+        <li><strong>Case Sensitive:</strong> Yes</li>
+        <li><strong>Max Sentence Length:</strong> 512</li>
+    </ul>
+</div>
+""", unsafe_allow_html=True)
+# References - Token Classification
+st.markdown('<div class="sub-title">References</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <ul>
+        <li><a class="link" href="https://huggingface.co/datasets/Jean-Baptiste/wikiner_fr" target="_blank" rel="noopener">CamemBERT WikiNER Dataset</a></li>
+        <li><a class="link" href="https://sparknlp.org/2022/09/23/camembert_base_token_classifier_wikiner_en.html" target="_blank" rel="noopener">CamemBERT Token Classification on Spark NLP Hub</a></li>
+    </ul>
+</div>
+""", unsafe_allow_html=True)

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+streamlit
+st-annotated-text
+streamlit-tags
+pandas
+numpy
+spark-nlp
+pyspark