shwetashweta05 commited on
Commit
946d14e
Β·
verified Β·
1 Parent(s): 62db24a

Update pages/NLP.py

Browse files
Files changed (1) hide show
  1. pages/NLP.py +148 -73
pages/NLP.py CHANGED
@@ -6,25 +6,21 @@ def show_home_page():
6
  st.markdown(
7
  """
8
  ### Welcome to NLP Guide
9
- Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on the interaction between
10
- computers and humans through natural language. It enables machines to read, understand, and respond to human
11
- language in a way that is both meaningful and useful. NLP powers a wide range of applications like chatbots,
12
- translation tools, sentiment analysis, and search engines.
13
- Use the menu in the sidebar to explore each topic in detail.
14
  """
15
  )
16
 
17
  # Function to display specific topic pages
18
  def show_page(page):
19
- if page == "NLP Terminologies":
20
- st.title("NLP Terminologies")
21
  st.markdown(
22
  """
23
- ### NLP Terminologies (Detailed Explanation)
24
- - **Tokenization**: Breaking text into smaller units like words or sentences.
25
- - **Stop Words**: Commonly used words (e.g., "the", "is") often removed during preprocessing.
26
- - **Stemming**: Reducing words to their root forms (e.g., "running" -> "run").
27
- - **Lemmatization**: Converting words to their dictionary base forms (e.g., "running" -> "run").
28
  - **Corpus**: A large collection of text used for NLP training and analysis.
29
  - **Vocabulary**: The set of all unique words in a corpus.
30
  - **n-grams**: Continuous sequences of n words/characters from text.
@@ -33,53 +29,72 @@ def show_page(page):
33
  - **Parsing**: Analyzing grammatical structure of text.
34
  """
35
  )
36
- elif page == "One-Hot Vectorization":
37
- st.title("One-Hot Vectorization")
38
  st.markdown(
39
  """
40
- ### One-Hot Vectorization
41
- A simple representation where each word in the vocabulary is represented as a binary vector.
42
- #### How It Works:
43
- - Each unique word in the corpus is assigned an index.
44
- - The vector for a word is all zeros except for a 1 at the index corresponding to that word.
45
  #### Example:
46
- Vocabulary: ["cat", "dog", "bird"]
47
- - "cat" -> [1, 0, 0]
48
- - "dog" -> [0, 1, 0]
49
- - "bird" -> [0, 0, 1]
50
  #### Advantages:
51
- - Simple to implement.
 
52
  #### Limitations:
53
- - High dimensionality for large vocabularies.
54
- - Does not capture semantic relationships (e.g., "cat" and "kitten" are unrelated).
55
- #### Applications:
56
- - Useful for small datasets and when computational simplicity is prioritized.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
  """
58
  )
59
  elif page == "Bag of Words":
60
- st.title("Bag of Words (BoW)")
61
  st.markdown(
62
  """
63
  ### Bag of Words (BoW)
64
- Bag of Words is a method of representing text data as word frequency counts without considering word order.
 
65
  #### How It Works:
66
  1. Create a vocabulary of all unique words in the text.
67
  2. Count the frequency of each word in a document.
68
  #### Example:
69
  Given two sentences:
70
- - "I love NLP."
71
- - "I love programming."
72
- Vocabulary: ["I", "love", "NLP", "programming"]
73
- - Sentence 1: [1, 1, 1, 0]
74
- - Sentence 2: [1, 1, 0, 1]
75
  #### Advantages:
76
- - Simple to implement.
 
 
77
  #### Limitations:
78
- - High dimensionality for large vocabularies.
79
- - Does not consider word order or semantic meaning.
80
- - Sensitive to noise and frequent terms.
81
  #### Applications:
82
- - Text classification and clustering.
 
 
 
83
  """
84
  )
85
  elif page == "TF-IDF Vectorizer":
@@ -87,21 +102,30 @@ def show_page(page):
87
  st.markdown(
88
  """
89
  ### TF-IDF Vectorizer
90
- Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure that evaluates the importance of a word in a document relative to a collection of documents (corpus).
91
  #### Formula:
92
  \[ \text{TF-IDF} = \text{TF} \times \text{IDF} \]
93
  - **Term Frequency (TF)**: Number of times a term appears in a document divided by total terms in the document.
94
  - **Inverse Document Frequency (IDF)**: Logarithm of total documents divided by the number of documents containing the term.
95
  #### Advantages:
96
- - Reduces the weight of common words.
97
- - Highlights unique and important words.
 
98
  #### Example:
99
  For the corpus:
100
  - Doc1: "NLP is amazing."
101
  - Doc2: "NLP is fun and amazing."
102
  TF-IDF highlights words like "fun" and "amazing" over commonly occurring words like "is".
 
 
 
 
103
  #### Applications:
104
- - Search engines, information retrieval, and document classification.
 
 
 
 
105
  """
106
  )
107
  elif page == "Word2Vec":
@@ -109,17 +133,40 @@ def show_page(page):
109
  st.markdown(
110
  """
111
  ### Word2Vec
112
- Word2Vec is a neural network-based technique for creating dense vector representations of words, capturing their semantic relationships.
113
- #### Key Concepts:
114
- - **CBOW (Continuous Bag of Words)**: Predicts the target word from its context.
115
- - **Skip-gram**: Predicts the context from the target word.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
116
  #### Advantages:
117
- - Captures semantic meaning (e.g., "king" - "man" + "woman" β‰ˆ "queen").
118
- - Efficient for large datasets.
 
119
  #### Applications:
120
- - Text classification, sentiment analysis, and recommendation systems.
 
 
 
121
  #### Limitations:
122
- - Requires significant computational resources.
 
 
123
  """
124
  )
125
  elif page == "FastText":
@@ -127,17 +174,22 @@ def show_page(page):
127
  st.markdown(
128
  """
129
  ### FastText
130
- FastText is an extension of Word2Vec that represents words as a combination of character n-grams.
131
  #### Advantages:
132
- - Handles rare and out-of-vocabulary words.
133
- - Captures subword information (e.g., prefixes and suffixes).
 
134
  #### Example:
135
- The word "playing" might be represented by n-grams like "pla", "lay", "ayi", "ing".
136
  #### Applications:
137
- - Multilingual text processing.
138
- - Handling noisy and incomplete data.
 
 
139
  #### Limitations:
140
- - Higher computational cost compared to Word2Vec.
 
 
141
  """
142
  )
143
  elif page == "Tokenization":
@@ -145,19 +197,28 @@ def show_page(page):
145
  st.markdown(
146
  """
147
  ### Tokenization
148
- Tokenization is the process of breaking text into smaller units (tokens) such as words, phrases, or sentences.
149
  #### Types of Tokenization:
150
  - **Word Tokenization**: Splits text into words.
151
  - **Sentence Tokenization**: Splits text into sentences.
152
  #### Libraries for Tokenization:
153
  - NLTK, SpaCy, and Hugging Face Transformers.
154
  #### Example:
155
- Sentence: "NLP is exciting."
156
- - Word Tokens: ["NLP", "is", "exciting", "."]
 
 
 
 
157
  #### Applications:
158
- - Preprocessing for machine learning models.
159
- #### Challenges:
160
- - Handling complex text like abbreviations and multilingual data.
 
 
 
 
 
161
  """
162
  )
163
  elif page == "Stop Words":
@@ -165,16 +226,30 @@ def show_page(page):
165
  st.markdown(
166
  """
167
  ### Stop Words
168
- Stop words are commonly used words in a language that are often removed during text preprocessing.
169
  #### Examples of Stop Words:
170
- - English: "is", "the", "and", "in".
171
- - Spanish: "es", "el", "y", "en".
 
 
 
172
  #### Why Remove Stop Words?
173
- - To reduce noise in text data.
 
 
 
 
 
 
174
  #### Applications:
175
- - Sentiment analysis, text classification, and search engines.
 
 
 
176
  #### Challenges:
177
- - Some stop words might carry context-specific importance.
 
 
178
  """
179
  )
180
 
@@ -182,8 +257,8 @@ def show_page(page):
182
  st.sidebar.title("NLP Topics")
183
  menu_options = [
184
  "Home",
185
- "NLP Terminologies",
186
- "One-Hot Vectorization",
187
  "Bag of Words",
188
  "TF-IDF Vectorizer",
189
  "Word2Vec",
 
6
  st.markdown(
7
  """
8
  ### Welcome to NLP Guide
9
+ Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on the interaction between computers and humans through natural language. The objective is to program computers to process and analyze large amounts of natural language data.
 
 
 
 
10
  """
11
  )
12
 
13
  # Function to display specific topic pages
14
  def show_page(page):
15
+ if page == "Text preprocessing":
16
+ st.title("Text preprocessing")
17
  st.markdown(
18
  """
19
+ ### Text preprocessing Text preprocessing converts raw data into a suitable format for computer models to understand and process that data. It processes all the data while preserving the actual meaning and context of human language in numbers. This preprocessing is done in multiple steps, but the number of steps can vary depending on the nature of the text and the goals you want to achieve with NLP.
20
+ - **Tokenization**: It breaks down text into smaller units called tokens. These tokens can be words, characters, or punctuation marks. For example, the sentence β€œI want to learn NLP.” would be tokenized into: I, want, to, learn, NLP,..
21
+ - **Stop Words**: Stopwords are words without meaning in the text, such as β€œis”, β€œthe”, and β€œand”. Removing these words makes it easier to focus on meaningful words.
22
+ - **Stemming**: Stemming strips away suffixes and reduces words to their base form. For example, β€œgoing” will be reduced to β€œgo”.
23
+ - **Lemmatization**: Lemmatization reduces words into lemmas that are always meaningful. It's a time-consuming process with a more complex algorithm than stemming.
24
  - **Corpus**: A large collection of text used for NLP training and analysis.
25
  - **Vocabulary**: The set of all unique words in a corpus.
26
  - **n-grams**: Continuous sequences of n words/characters from text.
 
29
  - **Parsing**: Analyzing grammatical structure of text.
30
  """
31
  )
32
+ elif page == "Vectorization":
33
+ st.title("Vectorization")
34
  st.markdown(
35
  """
36
+ ### Vectorization Vectorization in NLP is the process of converting text into numbers so that a computer can understand and analyze it. Since machines cannot read words like humans, we need to transform text into a format that they can processβ€”numerical vectors.
37
+ **One Hot Vectorization**:
38
+ One-Hot Vectorization is a way to represent words as numbers so that computers can understand them. It works by creating a unique binary vector for each word, where only one position is 1, and all other positions are 0.
 
 
39
  #### Example:
40
+ Vocabulary: ["apple", "banana", "orange"]
41
+ - "apple" -> [1, 0, 0]
42
+ - "banana" -> [0, 1, 0]
43
+ - "orange" -> [0, 0, 1]
44
  #### Advantages:
45
+ - Simple and easy to implement
46
+ - Works well for small vocabularies
47
  #### Limitations:
48
+ - High Dimensionality (Memory Usage)
49
+ - If the vocabulary is large (e.g., 100,000 words), each word gets a 100,000-dimensional vector.
50
+ - This leads to high memory usage and computational inefficiency.
51
+ - No Semantic Meaning (Context Ignored)
52
+ - One-hot vectors do not capture relationships between words.
53
+ - Example: "apple" and "fruit" should be similar, but their vectors are completely different.
54
+ - Sparse Representation
55
+ - Most of the values in one-hot vectors are 0s, making them sparse.
56
+ - Sparse matrices are inefficient to store and process.
57
+ - Fixed Vocabulary Size
58
+ - The vocabulary must be predefined.
59
+ - If a new word appears, the entire vectorization process must be redone.
60
+ ####Applications of One-Hot Vectorization:
61
+ - Text Classification
62
+ - Used in spam detection, sentiment analysis, and document categorization.
63
+ - Converts words into numerical form before applying machine learning models.
64
+ - Keyword Matching
65
+ - Helps in simple search and information retrieval by matching one-hot encoded words.
66
  """
67
  )
68
  elif page == "Bag of Words":
69
+ st.title("Bag of Words (BOW)")
70
  st.markdown(
71
  """
72
  ### Bag of Words (BoW)
73
+ The Bag of Words (BoW) model is a simple way to represent text as numerical features. It ignores word order and focuses only on the frequency of words in a document.
74
+
75
  #### How It Works:
76
  1. Create a vocabulary of all unique words in the text.
77
  2. Count the frequency of each word in a document.
78
  #### Example:
79
  Given two sentences:
80
+ - ""I love NLP and Machine Learning.""
81
+ - "Machine Learning is fun and exciting."
82
+ Vocabulary: ["I", "love","NLP","Machine","Learning","fun","exciting"]
83
+ - Sentence 1: [1,1,1,1,1,0,0]
84
+ - Sentence 2: [0,0,0,1,1,1,1,]
85
  #### Advantages:
86
+ - Simple and Easy to Implement – Works well for basic text processing tasks.
87
+ - Effective for Small Datasets – Good for applications with a limited vocabulary.
88
+ - Works with Traditional Machine Learning Models – Can be used with models like NaΓ―ve Bayes and SVM.
89
  #### Limitations:
90
+ - Ignores Word Order – "I love NLP" and "NLP love I" have the same representation.
91
+ - High Dimensionality – Large vocabularies lead to big feature matrices (sparse representation).
92
+ - Does Not Capture Meaning – Words with similar meanings ("happy" vs. "joyful") are treated separately.
93
  #### Applications:
94
+ - Text Classification – Spam detection, sentiment analysis.
95
+ - Information Retrieval – Search engines rank documents based on word frequency.
96
+ - Topic Modeling – Identifying common themes in documents.
97
+ - Document Similarity – Comparing text based on shared words.
98
  """
99
  )
100
  elif page == "TF-IDF Vectorizer":
 
102
  st.markdown(
103
  """
104
  ### TF-IDF Vectorizer
105
+ TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical representation of text that evaluates the importance of a word in a document relative to a collection of documents (corpus). It adjusts the word frequency based on how common or rare a word is across the corpus.
106
  #### Formula:
107
  \[ \text{TF-IDF} = \text{TF} \times \text{IDF} \]
108
  - **Term Frequency (TF)**: Number of times a term appears in a document divided by total terms in the document.
109
  - **Inverse Document Frequency (IDF)**: Logarithm of total documents divided by the number of documents containing the term.
110
  #### Advantages:
111
+ - Handles Common and Rare Words – Weighs words that are common in one document but rare across the corpus.
112
+ - Improves Relevance – Prioritizes important words by adjusting for their frequency in the whole corpus.
113
+ - Works Well with Search Engines – Helps rank documents based on relevant keywords.
114
  #### Example:
115
  For the corpus:
116
  - Doc1: "NLP is amazing."
117
  - Doc2: "NLP is fun and amazing."
118
  TF-IDF highlights words like "fun" and "amazing" over commonly occurring words like "is".
119
+ #### Limitations:
120
+ - Ignores Word Context – Does not consider the word’s meaning or context within the document (similar to Bag of Words).
121
+ - Requires Preprocessing – Needs cleaning and stopword removal for best results.
122
+ - Limited to Bag of Words Model – Does not capture word order or relationships (like Word2Vec or BERT).
123
  #### Applications:
124
+ - Text Classification – Categorizing documents into topics based on their content.
125
+ - Information Retrieval – Ranking search engine results based on the relevance of words in the query and documents.
126
+ - Document Clustering – Grouping similar documents based on shared important terms.
127
+ - Keyword Extraction – Identifying the most important keywords in a document.
128
+
129
  """
130
  )
131
  elif page == "Word2Vec":
 
133
  st.markdown(
134
  """
135
  ### Word2Vec
136
+ Word2Vec is a word embedding technique used to represent words in a continuous vector space, where semantically similar words are represented by vectors that are close together in that space. Word2Vec captures the relationships between words based on their context in a large corpus of text.
137
+ #### How Word2Vec Works:
138
+ - **Continuous Bag of Words (CBOW)**
139
+ - Predicts the target word based on the context words.
140
+ - Example: Given context words "I", "love", "coding", the model predicts the target word "NLP".
141
+ - **Skip-gram**
142
+ - Predicts the context words given a target word.
143
+ - Example: Given the target word "NLP", the model predicts the context words "I", "love", "coding".
144
+ #### Example of Word2Vec
145
+ - **Example of Word2Vec**
146
+ - Sentence 1: "I love programming"
147
+ - Sentence 2: "Programming is fun"
148
+ - Step 1: Create Context-Target Pairs (Skip-gram)
149
+ - For Sentence 1:
150
+ - Target Word: "love"
151
+ - Context Words: ["I", "programming"]
152
+ - For Sentence 2:
153
+ - Target Word: "programming"
154
+ - Context Words: ["I", "love", "is", "fun"]
155
+ - Step 2: Train the Word2Vec Model
156
+ - The model learns the embeddings such that words that frequently appear in similar contexts (like "love" and "programming") have similar vector representations.
157
  #### Advantages:
158
+ - Captures Semantic Meaning – Words with similar meanings or contexts are closer in the vector space.
159
+ - Reduces Dimensionality – Converts high-dimensional one-hot vectors into lower-dimensional, dense vectors.
160
+ - Generalizable – Can be applied across multiple languages and domains.
161
  #### Applications:
162
+ - Semantic Search – Improving search engines by ranking results based on word similarity.
163
+ - Text Classification – Representing text as vectors for machine learning models.
164
+ - Word Analogy – Solving word analogy problems like "king - man + woman = queen".
165
+ - Recommendation Systems – Suggesting content by understanding relationships between items (e.g., movies, products).
166
  #### Limitations:
167
+ - Contextual Ambiguity – Doesn't handle words with multiple meanings (e.g., "bank" as a financial institution vs. "bank" of a river).
168
+ - Requires Large Corpus – Needs a large text corpus to learn meaningful embeddings.
169
+ - Fixed Representations – Words have a single representation, which might not account for all meanings in different contexts.
170
  """
171
  )
172
  elif page == "FastText":
 
174
  st.markdown(
175
  """
176
  ### FastText
177
+ FastText is an extension of the Word2Vec model, developed by Facebook's AI Research (FAIR). While Word2Vec represents each word as a single vector, FastText improves upon this by representing each word as a bag of character n-grams. This enables FastText to generate better word representations, especially for rare or out-of-vocabulary words.
178
  #### Advantages:
179
+ - Better for Rare Words – FastText can generate meaningful embeddings for rare and out-of-vocabulary words, as it uses subword information.
180
+ - Handles Morphological Variations – It captures word variants better (e.g., "run", "running", and "runner" will be similar).
181
+ - Handles Subword Relationships – Since it looks at character n-grams, FastText captures similarities between words based on their internal structure.
182
  #### Example:
183
+ The word "apple" might be represented by n-grams like "app", "ppl", "ple".
184
  #### Applications:
185
+ - Handling Out-of-Vocabulary Words – Useful for applications like machine translation or speech recognition where new words might appear.
186
+ - Text Classification – Efficient in representing text for downstream tasks like sentiment analysis, spam detection, etc.
187
+ - Named Entity Recognition (NER) – FastText can better identify and classify entities, even if they are rare or domain-specific.
188
+ - Language Modeling – Helps in building more robust language models for text generation or speech-to-text applications.
189
  #### Limitations:
190
+ - Larger Model Size – Since it stores vectors for n-grams in addition to words, the model size can be larger than Word2Vec for the same vocabulary.
191
+ - Slower Training – Training on large datasets can be slower due to the additional computation required for subword processing.
192
+ - No Contextualized Representation – Like Word2Vec, FastText still does not provide context-sensitive word embeddings (words with different meanings in different contexts have the same representation).
193
  """
194
  )
195
  elif page == "Tokenization":
 
197
  st.markdown(
198
  """
199
  ### Tokenization
200
+ Tokenization is the process of breaking down a text (like a sentence or document) into smaller, meaningful units called tokens.
201
  #### Types of Tokenization:
202
  - **Word Tokenization**: Splits text into words.
203
  - **Sentence Tokenization**: Splits text into sentences.
204
  #### Libraries for Tokenization:
205
  - NLTK, SpaCy, and Hugging Face Transformers.
206
  #### Example:
207
+ Sentence: "Tokenization is fun!"
208
+ - Word Tokens: ["Tokenization", "is", "fun", "!"]
209
+ #### Advantages:
210
+ - Essential for Text Processing – It converts raw text into manageable pieces for further analysis.
211
+ - Enables NLP Models – Allows models to understand and work with text data, whether it's for classification, translation, or generation.
212
+ - Flexible for Various Tasks – Tokenization can be adapted to different tasks, whether you're working with word-level, subword-level, or character-level features.
213
  #### Applications:
214
+ - Text Preprocessing – Before performing other NLP tasks like text classification, named entity recognition (NER), and sentiment analysis, tokenization is typically performed.
215
+ - Machine Translation – In translation systems, tokenization helps break down sentences into manageable parts.
216
+ - Speech Recognition – Tokenization can be applied to convert spoken language into written text, typically breaking down phrases into individual words.
217
+ - Text Summarization – Tokenization can help break a long document into smaller units for summarization.
218
+ #### Limitations :
219
+ - Ambiguity with Punctuation – Tokenizing punctuation marks (e.g., "I'm" vs. "I am") can be tricky.
220
+ - Handling Compound Words – Some compound words may not be split in a way that is helpful for certain tasks.
221
+ - Language-Specific Issues – Tokenization rules vary for different languages. For example, in Chinese, there are no spaces between words, making tokenization more complex.
222
  """
223
  )
224
  elif page == "Stop Words":
 
226
  st.markdown(
227
  """
228
  ### Stop Words
229
+ Stop words are common words (such as "and", "the", "is", "in", "of") that are typically removed from text during preprocessing in natural language processing (NLP) tasks. These words often do not carry significant meaning and may add noise when analyzing text.
230
  #### Examples of Stop Words:
231
+ - Articles: "a", "an", "the"
232
+ - Prepositions: "in", "on", "at", "by", "with"
233
+ - Pronouns: "he", "she", "it", "they"
234
+ - Conjunctions: "and", "but", "or", "yet"
235
+ - Auxiliary Verbs: "is", "are", "was", "were"
236
  #### Why Remove Stop Words?
237
+ - No Meaningful Contribution: Words like "a", "an", "the", and "is" don't provide substantial information and can clutter text data.
238
+ - Reduce Dimensionality: Removing stop words helps in reducing the size of the vocabulary and makes analysis more efficient.
239
+ - Improve Model Performance: By removing words that don't contribute much meaning, the models can focus on the more informative words.
240
+ #### Advantages:
241
+ - Reduces Noise: Stops unnecessary words from affecting the analysis.
242
+ - Speeds Up Processing: Decreases the number of words to process, improving efficiency.
243
+ - Improves Accuracy: Helps algorithms focus on more meaningful words.
244
  #### Applications:
245
+ - Text Preprocessing – Stop words are often removed in the early stages of text analysis to clean the data.
246
+ - Information Retrieval – Helps improve search results by focusing on more meaningful keywords.
247
+ - Text Classification – When building models for classification tasks (e.g., spam detection), removing stop words can improve the model’s ability to classify based on relevant terms.
248
+ - Sentiment Analysis – Stop word removal can enhance sentiment detection by focusing on impactful words.
249
  #### Challenges:
250
+ - Context Loss: In some cases, stop words can carry important context, and removing them may change the meaning of the sentence.Example: "He is going to the store" – Removing "is" or "to" could lead to confusion.
251
+ - Language-Specific: What counts as a stop word can vary depending on the language. A word like "is" is common in English but may not be a stop word in another language.
252
+
253
  """
254
  )
255
 
 
257
  st.sidebar.title("NLP Topics")
258
  menu_options = [
259
  "Home",
260
+ "Text preprocessing",
261
+ "Vectorization",
262
  "Bag of Words",
263
  "TF-IDF Vectorizer",
264
  "Word2Vec",