English
File size: 5,069 Bytes
8eaf377
ba51bc8
 
fb40c4a
 
8eaf377
 
 
 
 
0734549
8eaf377
 
 
 
3796f48
 
 
 
 
8eaf377
3796f48
 
a1f7963
8eaf377
3796f48
 
 
 
44306b6
 
21d1dd9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c40464d
 
cd0ea4a
c40464d
380ff73
cd0ea4a
 
 
 
 
 
 
 
 
 
 
 
 
affb7f0
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
datasets:
- BEE-spoke-data/bees-internal
language:
- en
license: apache-2.0
---

# BeeTokenizer

> note: this is **literally** a tokenizer trained on beekeeping text

After minutes of hard work, it is now available.


```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("BEE-spoke-data/BeeTokenizer")

test_string = "When dealing with Varroa destructor mites, it's crucial to administer the right acaricides during the late autumn months, but only after ensuring that the worker bee population is free from pesticide contamination."

output = tokenizer(test_string)
print(f"Test string: {test_string}")
print(f"Tokens ({len(output.input_ids)}):\n\t{output.input_ids}")
```


## Notes

1. the default tokenizer (on branch `main`) has a vocab size of 32000
2. based on the `SentencePieceBPETokenizer` class 

<details>
  <summary>How to Tokenize Text and Retrieve Offsets</summary>
  
  To tokenize a complex sentence and also retrieve the offsets mapping, you can use the following Python code snippet:

  ```python
  from transformers import AutoTokenizer

  # Initialize the tokenizer
  tokenizer = AutoTokenizer.from_pretrained("BEE-spoke-data/BeeTokenizer")

  # Sample complex sentence related to beekeeping
  test_string = "When dealing with Varroa destructor mites, it's crucial to administer the right acaricides during the late autumn months, but only after ensuring that the worker bee population is free from pesticide contamination."

  # Tokenize the input string and get the offsets mapping
  output = tokenizer.encode_plus(test_string, return_offsets_mapping=True)

  print(f"Test string: {test_string}")

  # Tokens
  tokens = tokenizer.convert_ids_to_tokens(output['input_ids'])
  print(f"Tokens: {tokens}")

  # Offsets
  offsets = output['offset_mapping']
  print(f"Offsets: {offsets}")
  ```

  This should result in the following (_Feb '24 version_):
  
  ```
  >>> print(f"Test string: {test_string}")
  Test string: When dealing with Varroa destructor mites, it's crucial to administer the right acaricides during the late autumn months, but only after ensuring that the worker bee population is free from pesticide contamination.
  >>>
  >>> # Tokens
  >>> tokens = tokenizer.convert_ids_to_tokens(output['input_ids'])
  >>> print(f"Tokens: {tokens}")
  Tokens: ['When', '▁dealing', '▁with', '▁Varroa', '▁destructor', '▁mites,', "▁it's", '▁cru', 'cial', '▁to', '▁administer', '▁the', '▁right', '▁acar', 'icides', '▁during', '▁the', '▁late', '▁autumn', '▁months,', '▁but', '▁only', '▁after', '▁ensuring', '▁that', '▁the', '▁worker', '▁bee', '▁population', '▁is', '▁free', '▁from', '▁pesticide', '▁contam', 'ination.']
  >>>
  >>> # Offsets
  >>> offsets = output['offset_mapping']
  >>> print(f"Offsets: {offsets}")
  Offsets: [(0, 4), (4, 12), (12, 17), (17, 24), (24, 35), (35, 42), (42, 47), (47, 51), (51, 55), (55, 58), (58, 69), (69, 73), (73, 79), (79, 84), (84, 90), (90, 97), (97, 101), (101, 106), (106, 113), (113, 121), (121, 125), (125, 130), (130, 136), (136, 145), (145, 150), (150, 154), (154, 161), (161, 165), (165, 176), (176, 179), (179, 184), (184, 189), (189, 199), (199, 206), (206, 214)]
  ```

  if you compare this to the output of [the llama tokenizer](https://huggingface.co/fxmarty/tiny-llama-fast-tokenizer) (below), you can quickly see which is more suited for beekeeping related language modeling.

  ```
  >>> print(f"Test string: {test_string}")
  Test string: When dealing with Varroa destructor mites, it's crucial to administer the right acaricides during the late autumn months, but only after ensuring that the worker bee population is free from pesticide contamination.
  >>> # Tokens
  >>> tokens = tokenizer.convert_ids_to_tokens(output['input_ids'])
  >>> print(f"Tokens: {toke>>> print(f"Tokens: {tokens}")
  Tokens: ['<s>', '▁When', '▁dealing', '▁with', '▁Var', 'ro', 'a', '▁destruct', 'or', '▁mit', 'es', ',', '▁it', "'", 's', '▁cru', 'cial', '▁to', '▁admin', 'ister', '▁the', '▁right', '▁ac', 'ar', 'ic', 'ides', '▁during', '▁the', '▁late', '▁aut', 'umn', '▁months', ',', '▁but', '▁only', '▁after', '▁ens', 'uring', '▁that', '▁the', '▁worker', '▁be', 'e', '▁population', '▁is', '▁free', '▁from', '▁p', 'estic', 'ide', '▁cont', 'am', 'ination', '.']
  >>> offsets = output['offset_mapping']
  >>> print(f"Offsets: {offsets}")
  Offsets: [(0, 0), (0, 4), (4, 12), (12, 17), (17, 21), (21, 23), (23, 24), (24, 33), (33, 35), (35, 39), (39, 41), (41, 42), (42, 45), (45, 46), (46, 47), (47, 51), (51, 55), (55, 58), (58, 64), (64, 69), (69, 73), (73, 79), (79, 82), (82, 84), (84, 86), (86, 90), (90, 97), (97, 101), (101, 106), (106, 110), (110, 113), (113, 120), (120, 121), (121, 125), (125, 130), (130, 136), (136, 140), (140, 145), (145, 150), (150, 154), (154, 161), (161, 164), (164, 165), (165, 176), (176, 179), (179, 184), (184, 189), (189, 191), (191, 196), (196, 199), (199, 204), (204, 206), (206, 213), (213, 214)]
  ```