fullstack commited on
Commit
9914ec6
1 Parent(s): 5e62489

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. generate_training_data.py +319 -0
generate_training_data.py ADDED
@@ -0,0 +1,319 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import httpx
2
+ import json
3
+ import random
4
+ from typing import List, Dict
5
+ import threading
6
+ import queue
7
+ import time
8
+ import logging
9
+ import re
10
+ import os
11
+
12
+ # Set up logging
13
+ logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
14
+
15
+ import httpx
16
+ import json
17
+ import random
18
+ from typing import List, Dict
19
+ import threading
20
+ import queue
21
+ import time
22
+ import logging
23
+ import re
24
+
25
+ # Set up logging
26
+ logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
27
+
28
+ # Base URL for the API
29
+ BASE_URL = "http://localhost:6002/v1/chat/completions"
30
+
31
+ # API key (replace with your actual key)
32
+ API_KEY = "your-api-key-here"
33
+
34
+ # Existing MESSAGES and example_jsonl (abbreviated for brevity)
35
+ MESSAGES = """
36
+ ### Task: Generate high quality A Graduate-Level Google-Proof Q&A Benchmark expert detailed multiple choice questions and answers.
37
+ Then list the correct answer and 3 incorrect answers. Provide a detailed explanation for the correct answer only.
38
+
39
+ Here's a sophisticated example sample data output would look like:
40
+
41
+ -------
42
+ ### Category: cosmology
43
+ In cosmological studies, various observational techniques are employed to constrain the equation of state of dark energy. Consider the following methods:
44
+
45
+ Intensity mapping of neutral hydrogen emission lines at frequencies between 600 MHz and 1.4 GHz
46
+ Intensity mapping of CO emission lines at frequencies between 10 GHz to 20 GHz
47
+ Galaxy-redshift surveys at redshift < 2
48
+ Measurements of frequency shifts in absorption lines of cold damped Lyman-alpha systems at redshift < 2
49
+
50
+ Which of these methods is least effective in directly constraining the dark energy equation of state, and why?
51
+
52
+
53
+ A) Intensity mapping of neutral hydrogen emission lines at frequencies between 600 MHz and 1.4 GHz (incorrect)
54
+ B) Intensity mapping of CO emission lines at frequencies between 10 GHz to 20 GHz (incorrect)
55
+ C) Galaxy-redshift surveys at redshift < 2 (incorrect)
56
+ D) Measurements of frequency shifts in absorption lines of cold damped Lyman-alpha systems at redshift < 2 (correct)
57
+
58
+
59
+ Answer: D) Measurements of frequency shifts in absorption lines of cold damped Lyman-alpha systems at redshift < 2
60
+
61
+ The method least effective in directly constraining the dark energy equation of state is:
62
+ Measurements of frequency shifts in absorption lines of cold damped Lyman-alpha systems at redshift < 2
63
+
64
+ Explanation: While this method can provide valuable information about the expansion of the universe, it is less directly applicable to constraining the dark energy equation of state compared to the other options. The primary reasons are:
65
+
66
+ Limited redshift range: The specified redshift range (z < 2) is relatively low, whereas dark energy effects become more prominent at higher redshifts.
67
+ Indirect measurement: This method primarily measures the expansion rate of the universe rather than directly probing dark energy properties.
68
+ Systematics and uncertainties: Lyman-alpha systems can be affected by various astrophysical processes, introducing complexities in interpreting the data specifically for dark energy constraints.
69
+ Lack of large-scale structure information: Unlike galaxy surveys or intensity mapping techniques, this method doesn't provide information about the growth of large-scale structure, which is crucial for distinguishing between different dark energy models.
70
+
71
+ The other methods listed (intensity mapping and galaxy-redshift surveys) are generally more effective for constraining the dark energy equation of state because they provide complementary information about both the expansion history and the growth of structure in the universe, allowing for better constraints on dark energy properties. Also make a correct and 3 incorrect answers for Q and A.
72
+ -------
73
+
74
+ Return a new sophisticated example based on reference format, make sure yours is unique and data expert-level question provided with great lengthy explanation:
75
+
76
+ Do not talk about the answers in the question in the explanation. Only the reasoning behind the correct answer should be explained.
77
+
78
+ ### Category: {category}"""
79
+
80
+
81
+ example_jsonl = """{"category": "Biology", "question": "Given the observed correlation between the mutation in protein X and accelerated muscle growth, what is the most plausible mechanism by which this mutation could lead to the observed phenotype?", "correct_answer": "Increased protein X activity", "incorrect_answer1": "Decreased protein X activity", "incorrect_answer2": "Direct interaction with muscle fibers", "incorrect_answer3": "Alteration of gene expression in unrelated pathways", "explanation": "The most plausible mechanism is: \n\nIncreased protein X activity: \n\nExplanation:\nThe mutation likely leads to enhanced protein X function. \n* Increased protein synthesis: The mutation could increase the rate of protein X translation, resulting in a higher number of functional protein X molecules.\n* Enhanced protein stability: The mutation might make the protein X molecule more resistant to degradation, leading to a longer lifespan and increased overall activity.\n* Altered protein binding: The mutation could modify protein X's ability to bind to other molecules, potentially activating downstream signaling pathways involved in muscle growth.\n\nIncorrect Answers:\n1. Decreased protein X activity: This is unlikely, as the mutation leads to increased expression levels, suggesting a gain-of-function effect.\n2. Direct interaction with muscle fibers: Protein X might not directly interact with muscle cells, so this mechanism is less probable.\n3. Alteration of gene expression in unrelated pathways: While the mutation could have broader effects, the strongest link is to the increased protein X levels and its potential role in muscle growth."}"""
82
+
83
+ # List of categories
84
+ categories = [
85
+ "Law",
86
+ # "Physics", "Mathematics", "Computer Science", "Biology",
87
+ # "Chemistry", "History", "Literature", "Economics", "Psychology"
88
+ ]
89
+
90
+ categories = ["Metalworking (smelting, casting, forging)", "Machine Tools (lathes, mills, drills)", "Woodworking (saws, planers, jointers)", "Textiles (spinning, weaving, knitting)", "Ceramics and Glassmaking", "Paper Production", "Chemical Manufacturing (acids, bases, solvents)", "Plastics and Polymers", "Electrical Systems (generators, motors, transformers)", "Electronics (basic components, circuit boards)", "Fuel Production (biofuels, alcohol distillation)", "Refrigeration and Air Conditioning", "Food Processing and Preservation", "Agricultural Machinery", "Transportation (bicycles, carts, simple engines)", "Mining and Mineral Extraction", "Water Treatment and Distribution", "Waste Management and Recycling", "Construction Materials (cement, concrete, bricks)", "Energy Production (solar panels, wind turbines, hydropower)", "Toolmaking (hand tools, precision instruments)", "Rope and Cordage Manufacturing", "Adhesives and Sealants", "Paint and Coatings", "Lubricants and Greases", "Batteries and Energy Storage", "Welding and Joining Techniques", "Pneumatics and Hydraulics", "Pumps and Compressors", "Gears and Power Transmission", "Bearings and Bushings", "Springs and Shock Absorbers", "Fasteners (screws, bolts, rivets)", "Seals and Gaskets", "Filters and Separation Technology", "Heating Systems (furnaces, boilers)", "Plumbing and Pipe Fitting", "Insulation Materials", "Lighting Technology", "Timekeeping Devices", "Measurement and Surveying Tools", "Communication Systems (radio, telegraph)", "Printing and Information Dissemination", "Optics and Lens Crafting", "Medical Equipment Manufacturing", "Pharmaceutical Production", "Fertilizer and Soil Amendments", "Pest Control Methods", "Animal Husbandry Equipment", "Tanning and Leather Working", "Brewing and Fermentation", "Distillation Equipment", "Fiber Processing (cotton gins, wool carders)", "Metallurgy and Alloying", "Foundry Techniques", "Sheet Metal Working", "Blacksmithing and Ironworking", "Precision Grinding", "Heat Treatment of Metals", "Woodturning and Carving", "Joinery and Cabinetmaking", "Masonry and Stoneworking", "Roofing and Weatherproofing", "Paint and Pigment Production", "Dyeing and Coloring Techniques", "Pottery and Ceramics Forming", "Kiln Design and Operation", "Glass Blowing and Forming", "Fiber Reinforced Composites", "Rubber Processing and Vulcanization", "Oil Pressing and Refining", "Soap and Detergent Manufacturing", "Candle Making", "Rope Making and Cordage", "Basketry and Weaving", "Papermaking and Pulp Processing", "Bookbinding and Preservation", "Ink and Writing Implement Production", "Clockmaking and Timekeeping", "Lock and Key Manufacturing", "Simple Computing Devices", "Analog Instrumentation", "Geolocation and Mapping Tools", "Water Wheel and Turbine Design", "Wind Power Mechanisms", "Solar Thermal Technologies", "Biogas Production", "Charcoal Making", "Beekeeping Equipment", "Mushroom Cultivation", "Hydroponics and Aeroponics", "Greenhouse Design and Operation", "Seed Cleaning and Storage", "Food Dehydration Equipment", "Canning and Bottling Technology", "Refrigeration Without Electricity", "Evaporative Cooling Systems", "Water Filtration and Purification", "Waste Composting Systems", "Anaerobic Digestion", "Bioremediation Techniques"];
91
+
92
+ # Thread-safe queues and locks
93
+ result_queue = queue.Queue()
94
+ category_queue = queue.Queue()
95
+ file_lock = threading.Lock()
96
+
97
+ def create_chat_messages(category: str) -> List[Dict[str, str]]:
98
+ return [
99
+ # {"role": "system", "content": "You are an expert in creating graduate-level, Google-proof multiple-choice questions for various academic fields."},
100
+ {"role": "user", "content": f"""Generate a high-quality, graduate-level, Google-proof multiple-choice question for the category: {category}.
101
+
102
+ Include:
103
+ 1. The question
104
+ 2. Four answer choices (A, B, C, D) with one correct answer
105
+ 3. Indication of which answer is correct
106
+ 4. A detailed explanation for why the correct answer is right and the others are wrong
107
+
108
+ Make sure the question is sophisticated and challenging, suitable for graduate-level students in the field.
109
+
110
+ Here's an example of a well-structured question:
111
+ {MESSAGES.format(category=category)}"""}
112
+ ]
113
+
114
+ def generate_question(category: str) -> str:
115
+ messages = create_chat_messages(category)
116
+
117
+ payload = {
118
+ "model": "gemma", # or whatever model name your local server uses
119
+ "messages": messages,
120
+ "max_tokens": 2000,
121
+ "temperature": 0.7,
122
+ "top_p": 0.841,
123
+ "frequency_penalty": 0,
124
+ "presence_penalty": 0,
125
+ "n": 1,
126
+ "stream": False,
127
+ }
128
+
129
+ logging.info(f"Generating question for category: {category}")
130
+ with httpx.Client() as client:
131
+ response = client.post(BASE_URL, json=payload, timeout=60)
132
+ response.raise_for_status()
133
+ result = response.json()
134
+
135
+ res = result['choices'][0]['message']['content']
136
+ print(f"Generated question for {category}:\n\n{res}")
137
+ return res
138
+
139
+ def convert_to_jsonl(category: str, text: str) -> dict:
140
+ messages = [
141
+ {"role": "user", "content": f"""You are an expert in parsing and structuring data into JSON format. Convert the following text into a JSON object with these keys:
142
+ "category", "question", "correct_answer", "incorrect_answer1", "incorrect_answer2", "incorrect_answer3", "explanation"
143
+
144
+ example json:
145
+ ---
146
+ {example_jsonl}
147
+ ---
148
+
149
+ Here's the text to convert:
150
+
151
+ {text}
152
+
153
+ Return only the JSON object, nothing else."""},
154
+ {"role": "assistant", "content": '```jsonl\n'},
155
+ ]
156
+
157
+ client = httpx.Client(timeout=240)
158
+
159
+ json_schema = {
160
+ "type": "object",
161
+ "properties": {
162
+ "category": {"type": "string"},
163
+ "question": {"type": "string"},
164
+ "correct_answer": {"type": "string"},
165
+ "incorrect_answer1": {"type": "string"},
166
+ "incorrect_answer2": {"type": "string"},
167
+ "incorrect_answer3": {"type": "string"},
168
+ "explanation": {"type": "string"}
169
+ },
170
+ "required": ["category", "question", "correct_answer", "incorrect_answer1", "incorrect_answer2", "incorrect_answer3", "explanation"]
171
+ }
172
+
173
+ data = {
174
+ "model": "gemma",
175
+ "messages": messages,
176
+ "guided_json": json_schema,
177
+ "guided_decoding_backend": "lm-format-enforcer",
178
+ "temperature": 0.1,
179
+ "stop" : ["Let me know"],
180
+ "max_tokens": 1000
181
+ }
182
+
183
+ try:
184
+ response = client.post(BASE_URL, json=data, headers={
185
+ "Authorization": f"Bearer YOUR_API_KEY_HERE",
186
+ "Content-Type": "application/json"
187
+ })
188
+ response.raise_for_status()
189
+ result = response.json()
190
+ parsed_json = json.loads(result['choices'][0]['message']['content'])
191
+ print(f"JSON parsed for {category}:\n\n{json.dumps(parsed_json, indent=4)}")
192
+ return parsed_json
193
+ except (httpx.HTTPError, json.JSONDecodeError, KeyError) as e:
194
+ raise ValueError(f"Failed to parse JSON: {str(e)}")
195
+
196
+ def process_category(category: str):
197
+ try:
198
+ logging.info(f"Processing category: {category}")
199
+ question_text = generate_question(category)
200
+ parsed_json = convert_to_jsonl(category, question_text)
201
+ save_to_jsonl(json.dumps(parsed_json), f"{category.lower().replace(' ', '_')}.jsonl")
202
+
203
+ result_queue.put((category, json.dumps(parsed_json)))
204
+ except Exception as e:
205
+ logging.error(f"An error occurred while processing {category}: {str(e)}")
206
+
207
+ def save_to_jsonl(jsonl_text: str, filename: str):
208
+ logging.info(f"Attempting to save to file: {filename}")
209
+ logging.info(f"Content to be saved: {jsonl_text}")
210
+
211
+
212
+ try:
213
+ with open(filename, 'a', encoding='utf-8') as f:
214
+ f.write(jsonl_text + '\n')
215
+ logging.info(f"Successfully appended to {filename}")
216
+ except Exception as e:
217
+ logging.error(f"Error while writing to {filename}: {str(e)}")
218
+
219
+ # Verify file content after writing
220
+ try:
221
+ with open(filename, 'r', encoding='utf-8') as f:
222
+ updated_content = f.read()
223
+ logging.info(f"Updated content of {filename}:\n{updated_content}")
224
+ except Exception as e:
225
+ logging.error(f"Error while reading {filename} after write: {str(e)}")
226
+
227
+
228
+
229
+
230
+ def clean_and_parse_json(text: str) -> Dict:
231
+ # First, remove the ```jsonl wrapper if present
232
+ text = re.sub(r'```jsonl\s*|\s*```', '', text)
233
+
234
+ json_pattern = r'(?s)\{.*\}'
235
+ match = re.search(json_pattern, text)
236
+ if not match:
237
+ raise ValueError("No JSON-like structure found in the text")
238
+
239
+ json_str = match.group()
240
+ json_str = json_str.strip().strip('"').strip("'")
241
+ json_str = re.sub(r"(?<!\\)'", '"', json_str)
242
+
243
+ import ast
244
+ try:
245
+ # parsed_json = json.loads(json_str)
246
+ parsed_json = ast.literal_eval(json_str)
247
+ return parsed_json
248
+ except json.JSONDecodeError as e:
249
+ raise ValueError(f"Failed to parse JSON: {str(e)}")
250
+
251
+
252
+ # def process_category(category: str):
253
+ # try:
254
+ # logging.info(f"Processing category: {category}")
255
+ # question_text = generate_question(category)
256
+ # jsonl_text = convert_to_jsonl(category, question_text)
257
+
258
+ # try:
259
+ # parsed_json = clean_and_parse_json(jsonl_text)
260
+ # result_queue.put((category, json.dumps(parsed_json)))
261
+ # except ValueError as e:
262
+ # logging.error(f"Failed to parse JSON for {category}: {str(e)}")
263
+ # save_to_jsonl(jsonl_text, f"{category.lower().replace(' ', '_')}_failed.jsonl")
264
+ # except Exception as e:
265
+ # logging.error(f"An error occurred while processing {category}: {str(e)}")
266
+
267
+ def worker():
268
+ while True:
269
+ try:
270
+ category = category_queue.get(timeout=1) # Wait for 1 second
271
+ if category is None:
272
+ break
273
+ process_category(category)
274
+ except queue.Empty:
275
+ continue # If queue is empty, continue the loop
276
+ except Exception as e:
277
+ logging.error(f"Worker thread encountered an error: {str(e)}")
278
+ finally:
279
+ category_queue.task_done()
280
+
281
+ def main():
282
+ num_threads = 6 # Reduced number of threads to avoid overwhelming the API
283
+ output_file = "all_questions.jsonl"
284
+
285
+ threads = []
286
+ for _ in range(num_threads):
287
+ t = threading.Thread(target=worker)
288
+ t.daemon = True # Set threads as daemon
289
+ t.start()
290
+ threads.append(t)
291
+
292
+ for category in categories:
293
+ category_queue.put(category)
294
+
295
+ # Wait for all tasks to be completed
296
+ category_queue.join()
297
+
298
+ # Signal threads to exit
299
+ for _ in range(num_threads):
300
+ category_queue.put(None)
301
+
302
+ # Wait for all threads to finish
303
+ for t in threads:
304
+ t.join()
305
+
306
+ while not result_queue.empty():
307
+ category, jsonl_text = result_queue.get()
308
+ save_to_jsonl(jsonl_text, output_file)
309
+ logging.info(f"Question for {category} saved to {output_file}")
310
+
311
+ if __name__ == "__main__":
312
+ try:
313
+ main()
314
+ except KeyboardInterrupt:
315
+ logging.info("Script interrupted by user. Shutting down gracefully...")
316
+ except Exception as e:
317
+ logging.error(f"An unexpected error occurred: {str(e)}")
318
+ finally:
319
+ logging.info("Script execution completed.")