import streamlit as st # Function to display the Home Page def show_home_page(): st.title("Natural Language Processing (NLP)") st.markdown( """ ### Welcome to NLP Guide Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on the interaction between computers and humans through natural language. The objective is to program computers to process and analyze large amounts of natural language data. """ ) # Function to display specific topic pages def show_page(page): if page == "Text preprocessing": st.title("Text preprocessing") st.markdown( """ ### Text preprocessing Text preprocessing converts raw data into a suitable format for computer models to understand and process that data. It processes all the data while preserving the actual meaning and context of human language in numbers. This preprocessing is done in multiple steps, but the number of steps can vary depending on the nature of the text and the goals you want to achieve with NLP. - **Tokenization**: It breaks down text into smaller units called tokens. These tokens can be words, characters, or punctuation marks. For example, the sentence “I want to learn NLP.” would be tokenized into: I, want, to, learn, NLP,.. - **Stop Words**: Stopwords are words without meaning in the text, such as “is”, “the”, and “and”. Removing these words makes it easier to focus on meaningful words. - **Stemming**: Stemming strips away suffixes and reduces words to their base form. For example, “going” will be reduced to “go”. - **Lemmatization**: Lemmatization reduces words into lemmas that are always meaningful. It's a time-consuming process with a more complex algorithm than stemming. - **Corpus**: A large collection of text used for NLP training and analysis. - **Vocabulary**: The set of all unique words in a corpus. - **n-grams**: Continuous sequences of n words/characters from text. - **POS Tagging**: Assigning parts of speech to words. - **NER (Named Entity Recognition)**: Identifying names, places, organizations, etc. - **Parsing**: Analyzing grammatical structure of text. """ ) elif page == "Vectorization": st.title("Vectorization") st.markdown( """ ### Vectorization Vectorization in NLP is the process of converting text into numbers so that a computer can understand and analyze it. Since machines cannot read words like humans, we need to transform text into a format that they can process—numerical vectors. **One Hot Vectorization**: One-Hot Vectorization is a way to represent words as numbers so that computers can understand them. It works by creating a unique binary vector for each word, where only one position is 1, and all other positions are 0. #### Example: Vocabulary: ["apple", "banana", "orange"] - "apple" -> [1, 0, 0] - "banana" -> [0, 1, 0] - "orange" -> [0, 0, 1] #### Advantages: - Simple and easy to implement - Works well for small vocabularies #### Limitations: - High Dimensionality (Memory Usage) - If the vocabulary is large (e.g., 100,000 words), each word gets a 100,000-dimensional vector. - This leads to high memory usage and computational inefficiency. - No Semantic Meaning (Context Ignored) - One-hot vectors do not capture relationships between words. - Example: "apple" and "fruit" should be similar, but their vectors are completely different. - Sparse Representation - Most of the values in one-hot vectors are 0s, making them sparse. - Sparse matrices are inefficient to store and process. - Fixed Vocabulary Size - The vocabulary must be predefined. - If a new word appears, the entire vectorization process must be redone. ####Applications of One-Hot Vectorization: - Text Classification - Used in spam detection, sentiment analysis, and document categorization. - Converts words into numerical form before applying machine learning models. - Keyword Matching - Helps in simple search and information retrieval by matching one-hot encoded words. """ ) elif page == "Bag of Words": st.title("Bag of Words (BOW)") st.markdown( """ ### Bag of Words (BoW) The Bag of Words (BoW) model is a simple way to represent text as numerical features. It ignores word order and focuses only on the frequency of words in a document. #### How It Works: 1. Create a vocabulary of all unique words in the text. 2. Count the frequency of each word in a document. #### Example: Given two sentences: - ""I love NLP and Machine Learning."" - "Machine Learning is fun and exciting." Vocabulary: ["I", "love","NLP","Machine","Learning","fun","exciting"] - Sentence 1: [1,1,1,1,1,0,0] - Sentence 2: [0,0,0,1,1,1,1,] #### Advantages: - Simple and Easy to Implement – Works well for basic text processing tasks. - Effective for Small Datasets – Good for applications with a limited vocabulary. - Works with Traditional Machine Learning Models – Can be used with models like Naïve Bayes and SVM. #### Limitations: - Ignores Word Order – "I love NLP" and "NLP love I" have the same representation. - High Dimensionality – Large vocabularies lead to big feature matrices (sparse representation). - Does Not Capture Meaning – Words with similar meanings ("happy" vs. "joyful") are treated separately. #### Applications: - Text Classification – Spam detection, sentiment analysis. - Information Retrieval – Search engines rank documents based on word frequency. - Topic Modeling – Identifying common themes in documents. - Document Similarity – Comparing text based on shared words. """ ) elif page == "TF-IDF Vectorizer": st.title("TF-IDF Vectorizer") st.markdown( """ ### TF-IDF Vectorizer TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical representation of text that evaluates the importance of a word in a document relative to a collection of documents (corpus). It adjusts the word frequency based on how common or rare a word is across the corpus. #### Formula: \[ \text{TF-IDF} = \text{TF} \times \text{IDF} \] - **Term Frequency (TF)**: Number of times a term appears in a document divided by total terms in the document. - **Inverse Document Frequency (IDF)**: Logarithm of total documents divided by the number of documents containing the term. #### Advantages: - Handles Common and Rare Words – Weighs words that are common in one document but rare across the corpus. - Improves Relevance – Prioritizes important words by adjusting for their frequency in the whole corpus. - Works Well with Search Engines – Helps rank documents based on relevant keywords. #### Example: For the corpus: - Doc1: "NLP is amazing." - Doc2: "NLP is fun and amazing." TF-IDF highlights words like "fun" and "amazing" over commonly occurring words like "is". #### Limitations: - Ignores Word Context – Does not consider the word’s meaning or context within the document (similar to Bag of Words). - Requires Preprocessing – Needs cleaning and stopword removal for best results. - Limited to Bag of Words Model – Does not capture word order or relationships (like Word2Vec or BERT). #### Applications: - Text Classification – Categorizing documents into topics based on their content. - Information Retrieval – Ranking search engine results based on the relevance of words in the query and documents. - Document Clustering – Grouping similar documents based on shared important terms. - Keyword Extraction – Identifying the most important keywords in a document. """ ) elif page == "Word2Vec": st.title("Word2Vec") st.markdown( """ ### Word2Vec Word2Vec is a word embedding technique used to represent words in a continuous vector space, where semantically similar words are represented by vectors that are close together in that space. Word2Vec captures the relationships between words based on their context in a large corpus of text. #### How Word2Vec Works: - **Continuous Bag of Words (CBOW)** - Predicts the target word based on the context words. - Example: Given context words "I", "love", "coding", the model predicts the target word "NLP". - **Skip-gram** - Predicts the context words given a target word. - Example: Given the target word "NLP", the model predicts the context words "I", "love", "coding". #### Example of Word2Vec - **Example of Word2Vec** - Sentence 1: "I love programming" - Sentence 2: "Programming is fun" - Step 1: Create Context-Target Pairs (Skip-gram) - For Sentence 1: - Target Word: "love" - Context Words: ["I", "programming"] - For Sentence 2: - Target Word: "programming" - Context Words: ["I", "love", "is", "fun"] - Step 2: Train the Word2Vec Model - The model learns the embeddings such that words that frequently appear in similar contexts (like "love" and "programming") have similar vector representations. #### Advantages: - Captures Semantic Meaning – Words with similar meanings or contexts are closer in the vector space. - Reduces Dimensionality – Converts high-dimensional one-hot vectors into lower-dimensional, dense vectors. - Generalizable – Can be applied across multiple languages and domains. #### Applications: - Semantic Search – Improving search engines by ranking results based on word similarity. - Text Classification – Representing text as vectors for machine learning models. - Word Analogy – Solving word analogy problems like "king - man + woman = queen". - Recommendation Systems – Suggesting content by understanding relationships between items (e.g., movies, products). #### Limitations: - Contextual Ambiguity – Doesn't handle words with multiple meanings (e.g., "bank" as a financial institution vs. "bank" of a river). - Requires Large Corpus – Needs a large text corpus to learn meaningful embeddings. - Fixed Representations – Words have a single representation, which might not account for all meanings in different contexts. """ ) elif page == "FastText": st.title("FastText") st.markdown( """ ### FastText FastText is an extension of the Word2Vec model, developed by Facebook's AI Research (FAIR). While Word2Vec represents each word as a single vector, FastText improves upon this by representing each word as a bag of character n-grams. This enables FastText to generate better word representations, especially for rare or out-of-vocabulary words. #### Advantages: - Better for Rare Words – FastText can generate meaningful embeddings for rare and out-of-vocabulary words, as it uses subword information. - Handles Morphological Variations – It captures word variants better (e.g., "run", "running", and "runner" will be similar). - Handles Subword Relationships – Since it looks at character n-grams, FastText captures similarities between words based on their internal structure. #### Example: The word "apple" might be represented by n-grams like "app", "ppl", "ple". #### Applications: - Handling Out-of-Vocabulary Words – Useful for applications like machine translation or speech recognition where new words might appear. - Text Classification – Efficient in representing text for downstream tasks like sentiment analysis, spam detection, etc. - Named Entity Recognition (NER) – FastText can better identify and classify entities, even if they are rare or domain-specific. - Language Modeling – Helps in building more robust language models for text generation or speech-to-text applications. #### Limitations: - Larger Model Size – Since it stores vectors for n-grams in addition to words, the model size can be larger than Word2Vec for the same vocabulary. - Slower Training – Training on large datasets can be slower due to the additional computation required for subword processing. - No Contextualized Representation – Like Word2Vec, FastText still does not provide context-sensitive word embeddings (words with different meanings in different contexts have the same representation). """ ) elif page == "Tokenization": st.title("Tokenization") st.markdown( """ ### Tokenization Tokenization is the process of breaking down a text (like a sentence or document) into smaller, meaningful units called tokens. #### Types of Tokenization: - **Word Tokenization**: Splits text into words. - **Sentence Tokenization**: Splits text into sentences. #### Libraries for Tokenization: - NLTK, SpaCy, and Hugging Face Transformers. #### Example: Sentence: "Tokenization is fun!" - Word Tokens: ["Tokenization", "is", "fun", "!"] #### Advantages: - Essential for Text Processing – It converts raw text into manageable pieces for further analysis. - Enables NLP Models – Allows models to understand and work with text data, whether it's for classification, translation, or generation. - Flexible for Various Tasks – Tokenization can be adapted to different tasks, whether you're working with word-level, subword-level, or character-level features. #### Applications: - Text Preprocessing – Before performing other NLP tasks like text classification, named entity recognition (NER), and sentiment analysis, tokenization is typically performed. - Machine Translation – In translation systems, tokenization helps break down sentences into manageable parts. - Speech Recognition – Tokenization can be applied to convert spoken language into written text, typically breaking down phrases into individual words. - Text Summarization – Tokenization can help break a long document into smaller units for summarization. #### Limitations : - Ambiguity with Punctuation – Tokenizing punctuation marks (e.g., "I'm" vs. "I am") can be tricky. - Handling Compound Words – Some compound words may not be split in a way that is helpful for certain tasks. - Language-Specific Issues – Tokenization rules vary for different languages. For example, in Chinese, there are no spaces between words, making tokenization more complex. """ ) elif page == "Stop Words": st.title("Stop Words") st.markdown( """ ### Stop Words Stop words are common words (such as "and", "the", "is", "in", "of") that are typically removed from text during preprocessing in natural language processing (NLP) tasks. These words often do not carry significant meaning and may add noise when analyzing text. #### Examples of Stop Words: - Articles: "a", "an", "the" - Prepositions: "in", "on", "at", "by", "with" - Pronouns: "he", "she", "it", "they" - Conjunctions: "and", "but", "or", "yet" - Auxiliary Verbs: "is", "are", "was", "were" #### Why Remove Stop Words? - No Meaningful Contribution: Words like "a", "an", "the", and "is" don't provide substantial information and can clutter text data. - Reduce Dimensionality: Removing stop words helps in reducing the size of the vocabulary and makes analysis more efficient. - Improve Model Performance: By removing words that don't contribute much meaning, the models can focus on the more informative words. #### Advantages: - Reduces Noise: Stops unnecessary words from affecting the analysis. - Speeds Up Processing: Decreases the number of words to process, improving efficiency. - Improves Accuracy: Helps algorithms focus on more meaningful words. #### Applications: - Text Preprocessing – Stop words are often removed in the early stages of text analysis to clean the data. - Information Retrieval – Helps improve search results by focusing on more meaningful keywords. - Text Classification – When building models for classification tasks (e.g., spam detection), removing stop words can improve the model’s ability to classify based on relevant terms. - Sentiment Analysis – Stop word removal can enhance sentiment detection by focusing on impactful words. #### Challenges: - Context Loss: In some cases, stop words can carry important context, and removing them may change the meaning of the sentence.Example: "He is going to the store" – Removing "is" or "to" could lead to confusion. - Language-Specific: What counts as a stop word can vary depending on the language. A word like "is" is common in English but may not be a stop word in another language. """ ) # Sidebar navigation st.sidebar.title("NLP Topics") menu_options = [ "Home", "Text preprocessing", "Vectorization", "Bag of Words", "TF-IDF Vectorizer", "Word2Vec", "FastText", "Tokenization", "Stop Words", ] selected_page = st.sidebar.radio("Select a topic", menu_options) # Display the selected page if selected_page == "Home": show_home_page() else: show_page(selected_page)