alibayram commited on
Commit
94af74f
·
verified ·
1 Parent(s): 9c11d1b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -3
README.md CHANGED
@@ -1,3 +1,66 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - tr
5
+ ---
6
+
7
+ # TR Tokenizer
8
+
9
+ ## Tokenizer Summary
10
+ TR Tokenizer is an innovative FastTokenizer that splits Turkish words according to their semantic integrity, using both current natural language processing methods and Turkish grammar rules. This fast and efficient tokenizer provides accurate and detailed results by analyzing words morphologically and semantically. For example, the sentence "akademisyenler ve aileleri ile birlikte aktif çalışıyorlar" (academics and their families are actively working together) is split into the following parts:
11
+
12
+ ```json
13
+ ['akademisyen', 'ler', 've', 'aile', 'leri', 'ile', 'birlikte', 'aktif', 'çalış', 'ı', 'yor', 'lar']
14
+ ```
15
+
16
+ ## Supported Tasks and Applications
17
+ TR Tokenizer can be used for the following NLP tasks:
18
+ - **Morphological Analysis**: Analyzes the root and suffix structures of words.
19
+ - **Language Model Training and Fine-tuning**: Processes words according to their semantic integrity during the preprocessing phase of Turkish language model training.
20
+ - **Frequency Analysis**: Assists in determining word frequencies in texts.
21
+ - **Natural Language Processing (NLP) Research**: Used in research studying the morphological structure and word formations of the Turkish language.
22
+
23
+ ## Languages
24
+ This tokenizer focuses on the **Turkish** language and is designed to support Turkish's rich morphological structure.
25
+
26
+ ## Tokenizer Details
27
+ TR Tokenizer is implemented as a FastTokenizer, which provides high-performance tokenization capabilities. It combines Turkish grammar rules and current NLP methods to separate words into their roots and suffixes. The tokenizer is trained with predefined word and suffix lists and analyzes words while preserving the semantic integrity of Turkish.
28
+
29
+ ### Example Usage
30
+ ```python
31
+ # Load model directly
32
+ from transformers import AutoTokenizer
33
+
34
+ # Initialize the FastTokenizer
35
+ tokenizer = AutoTokenizer.from_pretrained("alibayram/tr_tokenizer", use_fast=True)
36
+
37
+ sentence = "akademisyenler ve aileleri ile birlikte aktif çalışıyorlar"
38
+ tokens = tokenizer.tokenize(sentence)
39
+ print(tokens)
40
+ # Output: ['akademisyen', 'ler', 've', 'aile', 'leri', 'ile', 'birlikte', 'aktif', 'çalış', 'ı', 'yor', 'lar']
41
+
42
+ # Encode the text to token IDs
43
+ encoded = tokenizer(sentence)
44
+ print(encoded)
45
+ ```
46
+
47
+ ## Licensing Information
48
+ TR Tokenizer is provided under the CC BY-NC 4.0 license and can be freely used for non-commercial purposes. Users can utilize this tokenizer for research and educational purposes, but additional permission is required for commercial use.
49
+
50
+ ## Citation Information
51
+ Researchers using this tokenizer are recommended to cite it as follows:
52
+
53
+ ```bibtex
54
+ @software{bayram_2024_tr_tokenizer,
55
+ author = {Bayram, M. Ali},
56
+ title = {{TR Tokenizer: Turkish Word Segmentation Tool Based on Semantic Integrity}},
57
+ year = 2024,
58
+ publisher = {Hugging Face},
59
+ url = {https://huggingface.co/alibayram/tr_tokenizer}
60
+ }
61
+ ```
62
+
63
+ ## Contact and Contributions
64
+ For more information about the project, to contribute, or provide feedback, you can contact [Ali Bayram](https://github.com/malibayram). All feedback and contributions are valuable for the development of the project.
65
+
66
+ Those who wish to contribute can visit the GitHub repository to contribute to the project. We thank you in advance for your contributions to the development of the project and the open-source community!