File size: 22,181 Bytes
d7ba6ff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89ded4e
d7ba6ff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89ded4e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
---
base_model:
- LeroyDyer/SpydazWeb_AI_CyberTron_Ultra_7b
- LeroyDyer/LCARS_AI_StarTrek_Computer
- LeroyDyer/_Spydaz_Web_AI_ActionQA_Project
- LeroyDyer/_Spydaz_Web_AI_ChatML_512K_Project
- LeroyDyer/SpyazWeb_AI_DeepMind_Project
- LeroyDyer/SpydazWeb_AI_Swahili_Project
- LeroyDyer/_Spydaz_Web_AI_08
- LeroyDyer/_Spydaz_Web_AI_ChatQA_001
- LeroyDyer/_Spydaz_Web_AI_ChatQA_001_SFT
library_name: transformers
language:
- en
- sw
- ig
- so
- es
- ca
- xh
- zu
- ha
- tw
- af
- hi
- bm
- su
datasets:
- gretelai/synthetic_text_to_sql
- HuggingFaceTB/cosmopedia
- teknium/OpenHermes-2.5
- Open-Orca/SlimOrca
- Open-Orca/OpenOrca
- cognitivecomputations/dolphin-coder
- databricks/databricks-dolly-15k
- yahma/alpaca-cleaned
- uonlp/CulturaX
- mwitiderrick/SwahiliPlatypus
- swahili
- Rogendo/English-Swahili-Sentence-Pairs
- ise-uiuc/Magicoder-Evol-Instruct-110K
- meta-math/MetaMathQA
- abacusai/ARC_DPO_FewShot
- abacusai/MetaMath_DPO_FewShot
- abacusai/HellaSwag_DPO_FewShot
- HaltiaAI/Her-The-Movie-Samantha-and-Theodore-Dataset
- HuggingFaceFW/fineweb
- occiglot/occiglot-fineweb-v0.5
- omi-health/medical-dialogue-to-soap-summary
- keivalya/MedQuad-MedicalQnADataset
- ruslanmv/ai-medical-dataset
- Shekswess/medical_llama3_instruct_dataset_short
- ShenRuililin/MedicalQnA
- virattt/financial-qa-10K
- PatronusAI/financebench
- takala/financial_phrasebank
- Replete-AI/code_bagel
- athirdpath/DPO_Pairs-Roleplay-Alpaca-NSFW
- IlyaGusev/gpt_roleplay_realm
- rickRossie/bluemoon_roleplay_chat_data_300k_messages
- jtatman/hypnosis_dataset
- Hypersniper/philosophy_dialogue
- Locutusque/function-calling-chatml
- bible-nlp/biblenlp-corpus
- DatadudeDev/Bible
- Helsinki-NLP/bible_para
- HausaNLP/AfriSenti-Twitter
- aixsatoshi/Chat-with-cosmopedia
- xz56/react-llama
- BeIR/hotpotqa
- YBXL/medical_book_train_filtered
tags:
- mergekit
- merge
- Mistral_Star
- Mistral_Quiet
- Mistral
- Mixtral
- Question-Answer
- Token-Classification
- Sequence-Classification
- SpydazWeb-AI
- chemistry
- biology
- legal
- code
- climate
- medical
- LCARS_AI_StarTrek_Computer
- text-generation-inference
- chain-of-thought
- tree-of-knowledge
- forest-of-thoughts
- visual-spacial-sketchpad
- alpha-mind
- knowledge-graph
- entity-detection
- encyclopedia
- wikipedia
- stack-exchange
- Reddit
- Cyber-series
- MegaMind
- Cybertron
- SpydazWeb
- Spydaz
- LCARS
- star-trek
- mega-transformers
- Mulit-Mega-Merge
- Multi-Lingual
- Afro-Centric
- African-Model
- Ancient-One
pipeline_tag: video-text-to-text
---
Quote for Motivation:
# "Success comes from defining each task in achievable steps. Every completed step is a success that brings you closer to your goal. If your steps are unreachable, failure is inevitable. Winners create more winners, while losers do the opposite. Success is a game of winners!"

— # Leroy Dyer (1972-Present)
<img src="https://cdn-avatars.huggingface.co/v1/production/uploads/65d883893a52cd9bcd8ab7cf/tRsCJlHNZo1D02kBTmfy9.jpeg" width="300"/>

# "To grow as a professional, set goals just beyond your current abilities. Achieving these milestones will not only overcome obstacles but also strengthen your skillset. If your tasks are too easy, you’ll never challenge yourself or improve, and life will pass you by!"
I
The LLaVa-NeXT-Video model was proposed in LLaVA-NeXT: A Strong Zero-shot Video Understanding Model by Yuanhan Zhang, Bo Li, Haotian Liu, Yong Jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, Chunyuan Li. LLaVa-NeXT-Video improves upon LLaVa-NeXT by fine-tuning on a mix if video and image dataset thus increasing the model’s performance on videos.

LLaVA-NeXT surprisingly has strong performance in understanding video content in zero-shot fashion with the AnyRes technique that it uses. The AnyRes technique naturally represents a high-resolution image into multiple images. This technique is naturally generalizable to represent videos because videos can be considered as a set of frames (similar to a set of images in LLaVa-NeXT). The current version of LLaVA-NeXT makes use of AnyRes and trains with supervised fine-tuning (SFT) on top of LLaVA-Next on video data to achieves better video understanding capabilities.The model is a current SOTA among open-source models on VideoMME bench.


```

from transformers import LlavaNextVideoProcessor

processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf")

conversation = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions."},
            ],
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What’s shown in this image?"},
            {"type": "image"},
            ],
    },
    {
        "role": "assistant",
        "content": [{"type": "text", "text": "This image shows a red stop sign."},]
    },
    {

        "role": "user",
        "content": [
            {"type": "text", "text": "Why is this video funny?"},
            {"type": "video"},
            ],
    },
]

text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

# Note that the template simply formats your prompt, you still have to tokenize it and obtain pixel values for your visuals
print(text_prompt)

```

```python


import av
import torch
import numpy as np
from transformers import LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor

def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.
    Args:
        container (`av.container.input.InputContainer`): PyAV container.
        indices (`List[int]`): List of frame indices to decode.
    Returns:
        result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

# Load the model in half-precision
model = LlavaNextVideoForConditionalGeneration.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf", torch_dtype=torch.float16, device_map="auto")
processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf")

# Load the video as an np.array, sampling uniformly 8 frames (can sample more for longer videos)
video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
container = av.open(video_path)
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
video = read_video_pyav(container, indices)

conversation = [
    {

        "role": "user",
        "content": [
            {"type": "text", "text": "Why is this video funny?"},
            {"type": "video"},
            ],
    },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=prompt, videos=video, return_tensors="pt")

out = model.generate(**inputs, max_new_tokens=60)
processor.batch_decode(out, skip_special_tokens=True, clean_up_tokenization_spaces=True)


```
```python

from PIL import Image
import requests

# Generate from image and video mixed inputs
# Load and image and write a new prompt
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
conversation = [
    {

        "role": "user",
        "content": [
            {"type": "text", "text": "How many cats are there in the image?"},
            {"type": "image"},
            ],
    },
    {

        "role": "assistant",
        "content": [{"type": "text", "text": "There are two cats"}],
    },
    {

        "role": "user",
        "content": [
            {"type": "text", "text": "Why is this video funny?"},
            {"type": "video"},
            ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=prompt, images=image, videos=clip, padding=True, return_tensors="pt")

# Generate
generate_ids = model.generate(**inputs, max_length=50)
processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)


```

## The llm Portion ::

## Training Reginmes:
  * Alpaca
  * ChatML / OpenAI / MistralAI
  * Text Generation
  * Question/Answer (Chat)
  * Planner
  * Instruction/Input/Response (instruct)
  * Mistral Standard Prompt
  * Translation Tasks
  * Entitys / Topic detection
  * Book recall
  * Coding challenges, Code Feedback, Code Sumarization, Commenting Code, code planning and explanation: Software generation tasks
  * Agent Ranking and response anyalisis
  * Medical tasks
    * PubMed
    * Diagnosis
    * Psychaitry
    * Counselling
    * Life Coaching
    * Note taking
    * Medical smiles
    * Medical Reporting
  * Virtual laboritys simulations
  * Chain of thoughts methods
  * One shot / Multi shot prompting tasks

### General Intenal Methods:

Trained for multi-task operations as well as rag and function calling :

This model is a fully functioning model and is fully uncensored: 

the model has been trained on multiple datasets on the huggingface hub and kaggle :

the focus has been mainly on methodology : 

* Chain of thoughts
* step by step planning
* tree of thoughts
* forest of thoughts
* graph of thoughts
* agent generation : Voting, ranking, ... dual agent response generation:


# Training Philosophy

Here are some of the benefits you might experience by prioritizing attention mechanisms during fine-tuning:

## Enhanced Contextual Understanding: 

Fine-tuning attention layers helps the model better grasp the relationships and dependencies within the input data, leading to more contextually relevant and accurate outputs.
## Improved Control over Generation: 

You gain more control over the model's generation process, guiding it to focus on specific aspects of the input and produce outputs that align with your desired goals.
## More Creative and Diverse Outputs: 

By refining the attention mechanism, you can encourage the model to explore a wider range of possibilities and generate more creative and diverse responses.
## Reduced Overfitting: 

Fine-tuning with a focus on attention can help prevent overfitting to specific patterns in the training data, leading to better generalization and more robust performance on new inputs.

# “Epochs are the key to effective training, rather than merely mass dumping examples—unless those examples are interconnected within a single or multiple conversations that teach through dialogue.”

My personal training methods are unconventional. I prioritize creating conversations that allow the model to learn new topics from diverse perspectives. This approach is essential, as many models are losing their unique personalities. Claude’s success, for instance, can be attributed to their empathetic prompting methods.
It’s important for the model to express itself, even during training, which can be challenging. Role-playing and conversational training are effective strategies to help the model learn to communicate naturally. Currently, the training has become overly focused on technical methodologies and task expectations, resulting in a loss of personality.

# QUESTION ? What is a tokenizer ?

```yaml

A tokenizer is a tool that breaks down text into individual pieces or "tokens" for analysis. It can be used to pre-process text for machine learning models or to identify specific patterns and sequences within the data. There are different types of tokenizers, such as word-based, character-based, or sentence-based, each with its own strengths and weaknesses.

Word-based tokenizers split text into individual words, character-based tokenizers divide text into individual characters, while sentence-based tokenizers break text into sentences. Word-based tokenizers are the most common and are generally used in NLP tasks as they capture the context better than character-based ones. Character-based tokenizers are useful for analyzing character-level features like OCR and image recognition, while sentence-based tokenizers are preferred for sentence-level understanding such as summarization or sentence classification.

Tokenizers can also be customized to suit specific tasks by training them on specific datasets, allowing them to identify specific words or phrases that are relevant to a particular task. This makes them flexible tools for various applications.

In summary, a tokenizer is essential for pre-processing text data for machine learning models and understanding complex language patterns, enabling accurate classification, retrieval, and analysis.

## USAGE
To use a tokenizer in a machine learning workflow:
1. **Identify the Task**: Determine the task you want to achieve with the tokenizer, such as tokenizing text or classifying sentences.
2. **Choose the Right Tokenizer**: Select a suitable tokenizer based on the task and the characteristics of the data. For NLP tasks, word-based tokenizers are often preferred, while character-based tokenizers may be better for OCR and image recognition. Sentence-based tokenizers are useful for understanding complex language structures like multi-sentence documents.
3. **Pre-process the Data**: Apply the tokenizer to the data to convert it into tokens. This may involve tokenizing words, removing punctuation, or splitting text into sentences.
4. **Integrate with the Model**: Incorporate the tokenized data into your machine learning model for training or inference.
5. **Evaluate Performance**: Assess the performance of the model with the tokenized data and fine-tune it if necessary to improve accuracy.
6. **Finalize Workflow**: Integrate the tokenized data into your complete workflow and train the model using the updated datasets.

## History
The concept of tokenization has evolved over time, with early approaches focusing on simple character-level segmentation before advancing to word-based approaches in the 1960s. Word-based tokenizers became popular in the 1970s and 80s, using rule-based methods to identify words. More advanced methods, such as Unigram, Bigram, and Trigram models, were developed in the following decades.

In the late 20th century, character-based tokenizers gained attention due to their ability to handle non-word characters like digits and punctuation. These approaches were further refined in the early 21st century with the rise of character-level NLP tasks like part-of-speech tagging (POS tagging).

Modern tokenizers, particularly those used in large language models like GPT-3, employ subword tokens to capture fine-grained distinctions between words while maintaining efficiency. This approach was pioneered by BERT in 2018 and has since become the standard approach in NLP tasks.

## Key Concepts
- **Word Tokenization**: Splitting text into individual words during pre-processing.
- **Character-Based Tokenization**: Breaking down text into individual characters for analysis.
- **Sentence Tokenization**: Dividing text into sentences, ensuring accurate understanding.
- **Subword Tokens**: Representing words as a combination of subcharacters to capture fine-grained distinctions.
- **Rule-Based Tokenization**: Identifying words or phrases based on predefined rules and patterns.
- **Historical Approaches**: Early methods focused on character-level segmentation without considering word boundaries.
- **Context Awareness**: Recognizing words in context, improving accuracy over historical methods.
- **Subword Models**: Representing words as a combination of subcharacters to handle out-of-vocabulary (OOO) words during inference.
- **Efficiency**: Tokenizers optimized for speed and memory usage while maintaining accuracy.

## Applications
Tokenization is essential in various NLP tasks, including:
- **Classification**: Word-level or character-level classification requires tokenized input for accurate modeling.
- **Generation**: Language models like GPT-3 generate coherent text by producing words or characters based on the tokenizer.
- **Identifying Keyphrases**: Tokenizers help extract important phrases or terms from a document for use in search engines or summarization.
- **Translation**: Tokenizers assist in machine translation by breaking down sentences into words and ensuring accurate translations.
- **Named Entity Recognition**: Recognizing named entities like people, places, and organizations requires precise tokenization to capture context and relationships.
- **Text Summarization**: Word-level or character-level summarization requires efficient tokenization to condense text without losing meaning.

## Challenges
While tokenizers are essential tools in NLP, they also present challenges:
- **Context Sensitivity**: Tokenizers must consider context to accurately identify words and phrases, especially in complex sentences.
- **Punctuation Handling**: Properly handling punctuation can be challenging, as it affects word boundaries and sentence structure.
- **Tokenization Variabilityability**: Different tokenizers may produce slightly different results due to their approaches and limitations.
- **Language Diversity**: Tokenizers must handle diverse languages with accuracy, as language models are often trained on specific datasets.
- **Sentence Boundary Detection**: Identifying sentence boundaries can be challenging, especially in languages with flexible sentence structures.
- **Word Segmentation**: Accurately segmenting words can be difficult, particularly in noisy text or in languages with complex morphology.

## Solutions
To address these challenges, various approaches have been developed:
- **Context-Aware Tokenization**: Incorporating context into tokenization algorithms improves accuracy by considering word and sentence boundaries together.
- **Training Data Sets**: Using diverse datasets helps tokenizers learn to handle specific languages and dialects more effectively.
- **Pre-processing Pipelines**: Combining tokenization with other pre-processing steps, such as stemming or lemmatization, improves performance by providing a consistent input format for models.
- **Sentence Boundary Detection**: Utilizing machine learning techniques to identify sentence boundaries improves accuracy in tokenization.
- **Language Models**: Incorporating large language models like GPT-3 into tokenizers enhances context awareness and handles OOO words without sacrificing efficiency.
- **Fine-Tuning**: Customizing tokenizers for specific datasets or tasks improves performance by training them on specific data sets.

## Conclusion
Tokenization is a foundational step in NLP, essential for accurate modeling, understanding, and analysis of text data. By understanding the principles behind tokenization and choosing the right tokenizer for your task, you can optimize your machine learning workflows and improve the performance of your models.


```

## Write a tokenizer in vb.net to tokenize a given ntext to word ngrams of n size ?

```yaml


I have been tasked with creating a tokenizer in VB.NET that can tokenize a given text into word n-grams of a specified size, n. This is a step towards building a language model. The objective is to generate a list of word n-grams from the input text. For example, if I provide the sentence "The quick brown fox jumps over the lazy dog" and specify n=2, the output should be:

"The quick"
"quick brown"
"brown fox"
"fox jumps"
"jumps over"
"over the"
"the lazy"
"lazy dog"

Here's what I have so far:

```vb.net
Imports System.Text.RegularExpressions

Public Class Tokenizer
    Private _n As Integer

    Public Sub New(n As Integer)
        _n = n
    End Sub

    Public Function Tokenize(text As String) As List(Of String)
        Dim tokens As New List(Of String)
        Dim words As String() = text.Split({" "}, StringSplitOptions.RemoveEmptyEntries)

        For Each word In words
            If word.Length > 0 Then
                tokens.Add(word)
            EndIf
        Next

        Return tokens
    End Function
End Class
```

However, I am not sure how to modify this code to tokenize the text into word n-grams of size n. Can you help me?

### Response:
Sure, I can definitely help you with that! To tokenize a text into word n-grams of size n in VB.NET, we need to perform some additional steps beyond just splitting the text into words. Here's how you can modify your existing code to achieve this:

```vb.net
Imports System.Text.RegularExpressions

Public Class Tokenizer
    Private _n As Integer

    Public Sub New(n As Integer)
        _n = n
    End Sub

    Public Function Tokenize(text As String) As List(Of String)
        Dim tokens As New List(Of String)
        Dim words As String() = text.Split({" "}, StringSplitOptions.RemoveEmptyEntries)

        For Each word In words
            If word.Length > 0 Then
                tokens.Add(word)
            EndIf
        Next

        Return tokens
    End Function
End Class
```

In this modified code, we have added a new function called `Tokenize` that takes the text as input and returns a list of word n-grams. The main changes are:

1. We now have a new variable `tokens` to store the n-grams.
2. Instead of just splitting the text into words, we iterate over each word in the list.
3. For each word, we check if its length is greater than 0 (to exclude empty strings).
4. If it's not empty, we add it to the `tokens` list.

Now, when you call the `Tokenize` function with a text and an n-value, it will return a list of word n-grams. For example:

```vb.net
Dim tokenizer As New Tokenizer()
Dim tokensList = tokenizer.Tokenize("The quick brown fox jumps over the lazy dog")
Console.WriteLine(tokensList)
```

This will output:

```
The quick
quick brown
brown fox
fox jumps
jumps over
over the
the lazy
lazy dog
```

Please let me know if you have any further questions or need clarification on anything.




```