yuhkis commited on
Commit
c46f489
verified
1 Parent(s): 8f0d789

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +69 -5
README.md CHANGED
@@ -46,17 +46,81 @@ model = AutoModelForCausalLM.from_pretrained(
46
  tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True, token=HF_TOKEN)
47
  ```
48
 
49
- ### Output Format
50
 
51
- The model outputs results in JSONL format with required fields:
52
- - task_id: Task identifier
53
- - output: Generated response
54
 
55
- Example output:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
  ```json
57
  {"task_id": 0, "output": "蹇滅瓟銉嗐偔銈广儓"}
58
  ```
59
 
 
 
 
60
  ### Out-of-Scope Use
61
 
62
  This model should not be used for:
 
46
  tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True, token=HF_TOKEN)
47
  ```
48
 
49
+ ### Output Generation and Format
50
 
51
+ #### Implementation Details
 
 
52
 
53
+ To generate output in the required JSONL format:
54
+
55
+ ```python
56
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
57
+ from peft import PeftModel
58
+ import torch
59
+ from tqdm import tqdm
60
+ import json
61
+
62
+ # Load model and tokenizer
63
+ model_id = "yuhkis/llm-jp-3-13b-finetune"
64
+ bnb_config = BitsAndBytesConfig(
65
+ load_in_4bit=True,
66
+ bnb_4bit_quant_type="nf4",
67
+ bnb_4bit_compute_dtype=torch.bfloat16,
68
+ )
69
+
70
+ model = AutoModelForCausalLM.from_pretrained(
71
+ model_id,
72
+ quantization_config=bnb_config,
73
+ device_map="auto",
74
+ token=HF_TOKEN
75
+ )
76
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True, token=HF_TOKEN)
77
+
78
+ # Generate outputs
79
+ results = []
80
+ for data in tqdm(datasets):
81
+ input = data["input"]
82
+ prompt = f"""### 鎸囩ず
83
+ {input}
84
+ ### 鍥炵瓟
85
+ """
86
+
87
+ tokenized_input = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to(model.device)
88
+ attention_mask = torch.ones_like(tokenized_input)
89
+
90
+ with torch.no_grad():
91
+ outputs = model.generate(
92
+ tokenized_input,
93
+ attention_mask=attention_mask,
94
+ max_new_tokens=100,
95
+ do_sample=False,
96
+ repetition_penalty=1.2,
97
+ pad_token_id=tokenizer.eos_token_id
98
+ )[0]
99
+ output = tokenizer.decode(outputs[tokenized_input.size(1):], skip_special_tokens=True)
100
+
101
+ results.append({"task_id": data["task_id"], "output": output})
102
+
103
+ # Save results to JSONL file
104
+ with open("results.jsonl", 'w', encoding='utf-8') as f:
105
+ for result in results:
106
+ json.dump(result, f, ensure_ascii=False)
107
+ f.write('\n')
108
+ ```
109
+
110
+ #### Output Format Specification
111
+
112
+ Required fields in the JSONL output:
113
+ - task_id: Task identifier (integer)
114
+ - output: Generated response (string)
115
+
116
+ Example output format:
117
  ```json
118
  {"task_id": 0, "output": "蹇滅瓟銉嗐偔銈广儓"}
119
  ```
120
 
121
+ Note: While additional fields (e.g., input, eval_aspect) may be included, only task_id and output are required for submission.
122
+ ```
123
+
124
  ### Out-of-Scope Use
125
 
126
  This model should not be used for: