nlp-waseda
/

gpt2-xl-japanese

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

gpt2-xl-japanese / README.md

schnell's picture

Update README.md

f5cd347 almost 2 years ago

|

3.33 kB

	---
	language:
	- ja
	license: cc-by-sa-4.0
	datasets:
	- wikipedia
	- cc100
	widget:
	- text: "早稲田大学で自然言語処理を"
	---

	# nlp-waseda/gpt2-xl-japanese

	This model is Japanese GPT-2 pretrained on Japanese Wikipedia and CC-100.
	The model architecture of the model are based on [Radford+ 2019](https://paperswithcode.com/paper/language-models-are-unsupervised-multitask).

	## Intended uses & limitations

	You can use the raw model for text generation or fine-tune it to a downstream task.

	Note that the texts should be segmented into words using [Juman++](https://github.com/ku-nlp/jumanpp) in advance.

	### How to use

	You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

	```python
	from transformers import pipeline, set_seed
	generator = pipeline('text-generation', model='nlp-waseda/gpt2-xl-japanese')
	# If you use gpu.
	# generator = pipeline('text-generation', model='nlp-waseda/gpt2-xl-japanese', device=0)

	set_seed(42)
	generator("早稲田大学で自然言語処理を", max_length=30, do_sample=True, pad_token_id=2, num_return_sequences=5)
	[{'generated_text': '早稲田大学で自然言語処理を勉強している大学生です. 自然言語処理や音声認識, 機械学習等に興味があり, 特に画像'},
	{'generated_text': '早稲田大学で自然言語処理を学んでいるとある方とお会いしてきました. 今日はお話する時間が少なかったのですが,'},
	{'generated_text': '早稲田大学で自然言語処理を研究しているが、それを趣味とは思わず、会社を作るための手段ととらえているようです。'},
	{'generated_text': '早稲田大学で自然言語処理を専門的に学ぶサークルです。日本語教育センターで日本語を勉強した中国の人たちと交流する'},
	{'generated_text': '早稲田大学で自然言語処理を専攻した時に、数学の知識・プログラミング言語の知識が身についていたのは、とても役'}]
	```

	```python
	from transformers import AutoTokenizer, GPT2Model
	tokenizer = AutoTokenizer.from_pretrained('nlp-waseda/gpt2-xl-japanese')
	model = GPT2Model.from_pretrained('nlp-waseda/gpt2-xl-japanese')
	text = "早稲田大学で自然言語処理を"
	encoded_input = tokenizer(text, return_tensors='pt')
	output = model(**encoded_input)
	```

	### Preprocessing

	The texts are normalized using [neologdn](https://github.com/ikegami-yukino/neologdn), segmented into words using [Juman++](https://github.com/ku-nlp/jumanpp), and tokenized by [BPE](https://huggingface.co/docs/tokenizers/api/models#tokenizers.models.BPE). Juman++ 2.0.0-rc3 was used for pretraining.

	The model was trained on 8 NVIDIA A100 GPUs.


	# Acknowledgments

	This work was supported by Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures (JHPCN) through General Collaboration Project no. jh221004, "Developing a Platform for Constructing and Sharing of Large-Scale Japanese Language Models".

	For training models, we used the [mdx](https://mdx.jp/): a platform for the data-driven future.