avantol commited on
Commit
025ac5e
·
1 Parent(s): 3176ee9

feat(pfb): fix paths in notebook

Browse files
__init__.py ADDED
File without changes
serialized_file_creation_demo/serialized_file_creation_demo.ipynb CHANGED
@@ -3,20 +3,23 @@
3
  {
4
  "cell_type": "markdown",
5
  "id": "0",
6
- "metadata": {},
 
 
7
  "source": [
8
  "# Creation of Serialized File From AI Model Output\n",
9
  "---\n",
10
- "This notebook demonstrates how to use the AI-assited data model output (originally just a collection of TSV files) to a serialized file, a [PFB (Portable Format for Bioinformatics)](https://pmc.ncbi.nlm.nih.gov/articles/PMC10035862/) file. \n",
11
  "\n",
12
- "PFB is widely used within NIH-funded initiativies that our center is a part of, as a means for efficient storage and transfer of data between systems.\n",
13
- " "
14
  ]
15
  },
16
  {
17
  "cell_type": "markdown",
18
  "id": "1",
19
- "metadata": {},
 
 
20
  "source": [
21
  "### Setup"
22
  ]
@@ -25,7 +28,13 @@
25
  "cell_type": "code",
26
  "execution_count": null,
27
  "id": "2",
28
- "metadata": {},
 
 
 
 
 
 
29
  "outputs": [],
30
  "source": [
31
  "%pip install pandas gen3"
@@ -34,7 +43,9 @@
34
  {
35
  "cell_type": "markdown",
36
  "id": "3",
37
- "metadata": {},
 
 
38
  "source": [
39
  "We need some helper files to demonstrate this, so pull them in from Huggingface."
40
  ]
@@ -43,17 +54,24 @@
43
  "cell_type": "code",
44
  "execution_count": null,
45
  "id": "4",
46
- "metadata": {},
 
 
 
 
 
 
47
  "outputs": [],
48
  "source": [
49
- "!git clone https://huggingface.co/spaces/uc-ctds/llama-data-model-generator-demo\n",
50
- "!cd llama-data-model-generator-demo/serialized_file_creation_demo"
51
  ]
52
  },
53
  {
54
  "cell_type": "markdown",
55
  "id": "5",
56
- "metadata": {},
 
 
57
  "source": [
58
  "### Imports and Initial Loading"
59
  ]
@@ -62,10 +80,12 @@
62
  "cell_type": "code",
63
  "execution_count": null,
64
  "id": "6",
65
- "metadata": {},
 
 
66
  "outputs": [],
67
  "source": [
68
- "from utils import *\n",
69
  "import os\n",
70
  "from pathlib import Path\n",
71
  "import pandas as pd"
@@ -75,7 +95,9 @@
75
  "cell_type": "code",
76
  "execution_count": null,
77
  "id": "7",
78
- "metadata": {},
 
 
79
  "outputs": [],
80
  "source": [
81
  "# read in the minimal Gen3 data model scaffold\n",
@@ -86,7 +108,9 @@
86
  {
87
  "cell_type": "markdown",
88
  "id": "8",
89
- "metadata": {},
 
 
90
  "source": [
91
  "We are demonstrating the ability to use this against an AI-generated model, but not directly inferencing to get the data model. Instead we're using a Sythnetic Data Contribution (a sample of what a data contributor would provide AND the expected simplified data model). We use these to train and test the AI model. For simplicity, we're using the model here."
92
  ]
@@ -95,7 +119,9 @@
95
  "cell_type": "code",
96
  "execution_count": null,
97
  "id": "9",
98
- "metadata": {},
 
 
99
  "outputs": [],
100
  "source": [
101
  "# Find the simplified data model in a Synthetic Data Contribution directory\n",
@@ -111,7 +137,9 @@
111
  "cell_type": "code",
112
  "execution_count": null,
113
  "id": "10",
114
- "metadata": {},
 
 
115
  "outputs": [],
116
  "source": [
117
  "sdm = read_schema(schema=sdm_path)"
@@ -120,7 +148,9 @@
120
  {
121
  "cell_type": "markdown",
122
  "id": "11",
123
- "metadata": {},
 
 
124
  "source": [
125
  "### Creation of Serialized File"
126
  ]
@@ -128,7 +158,9 @@
128
  {
129
  "cell_type": "markdown",
130
  "id": "12",
131
- "metadata": {},
 
 
132
  "source": [
133
  "As of writing, PFB requires a Gen3-style data model, so the next steps are to ensure we can go from the simplified AI model output to a Gen3 data model. Note that in the future we may allow alternative, non-Gen3 models to create such PFBs."
134
  ]
@@ -137,7 +169,9 @@
137
  "cell_type": "code",
138
  "execution_count": null,
139
  "id": "13",
140
- "metadata": {},
 
 
141
  "outputs": [],
142
  "source": [
143
  "## Create a Gen3 data model from the simplified data model\n",
@@ -157,7 +191,9 @@
157
  "cell_type": "code",
158
  "execution_count": null,
159
  "id": "14",
160
- "metadata": {},
 
 
161
  "outputs": [],
162
  "source": [
163
  "## Write the Gen3-style data model to a JSON file\n",
@@ -171,7 +207,9 @@
171
  {
172
  "cell_type": "markdown",
173
  "id": "15",
174
- "metadata": {},
 
 
175
  "source": [
176
  "Now we have the data model in proper format, we can serialize it into a PFB."
177
  ]
@@ -180,7 +218,9 @@
180
  "cell_type": "code",
181
  "execution_count": null,
182
  "id": "16",
183
- "metadata": {},
 
 
184
  "outputs": [],
185
  "source": [
186
  "# Convert the Gen3-style data model to PFB format schema\n",
@@ -191,7 +231,9 @@
191
  {
192
  "cell_type": "markdown",
193
  "id": "17",
194
- "metadata": {},
 
 
195
  "source": [
196
  "### PFB Utilities"
197
  ]
@@ -199,7 +241,9 @@
199
  {
200
  "cell_type": "markdown",
201
  "id": "18",
202
- "metadata": {},
 
 
203
  "source": [
204
  "Now we can demonstrate creation of a PFB when you have content for it (in this case in the form of TSV metadata). The above is a PFB which contains only the data model."
205
  ]
@@ -208,7 +252,9 @@
208
  "cell_type": "code",
209
  "execution_count": null,
210
  "id": "19",
211
- "metadata": {},
 
 
212
  "outputs": [],
213
  "source": [
214
  "# Get a list of TSV files in the sdm_dir\n",
@@ -220,7 +266,9 @@
220
  "cell_type": "code",
221
  "execution_count": null,
222
  "id": "20",
223
- "metadata": {},
 
 
224
  "outputs": [],
225
  "source": [
226
  "# calculate tsv file size and md5sum for each tsv_files\n",
@@ -254,7 +302,9 @@
254
  "cell_type": "code",
255
  "execution_count": null,
256
  "id": "21",
257
- "metadata": {},
 
 
258
  "outputs": [],
259
  "source": [
260
  "%ls -l $sdm_dir/tsv_metadata"
@@ -264,7 +314,9 @@
264
  "cell_type": "code",
265
  "execution_count": null,
266
  "id": "22",
267
- "metadata": {},
 
 
268
  "outputs": [],
269
  "source": [
270
  "tsv_metadata"
@@ -274,7 +326,9 @@
274
  "cell_type": "code",
275
  "execution_count": null,
276
  "id": "23",
277
- "metadata": {},
 
 
278
  "outputs": [],
279
  "source": [
280
  "pfb_data = os.path.join(sdm_dir, Path(out_file).stem + \"_data.avro\")\n",
@@ -286,7 +340,9 @@
286
  {
287
  "cell_type": "markdown",
288
  "id": "24",
289
- "metadata": {},
 
 
290
  "source": [
291
  "PFB contains a utility to convert from the serialized format to more readable and workable files, including TSVs. Here we demonstrate that utility:"
292
  ]
@@ -295,7 +351,9 @@
295
  "cell_type": "code",
296
  "execution_count": null,
297
  "id": "25",
298
- "metadata": {},
 
 
299
  "outputs": [],
300
  "source": [
301
  "!gen3 pfb to -i $pfb_data tsv # convert the PFB file to TSV format"
@@ -305,7 +363,9 @@
305
  "cell_type": "code",
306
  "execution_count": null,
307
  "id": "26",
308
- "metadata": {},
 
 
309
  "outputs": [],
310
  "source": [
311
  "!gen3 pfb show -i $pfb_data # show the contents of the PFB file"
@@ -315,7 +375,9 @@
315
  "cell_type": "code",
316
  "execution_count": null,
317
  "id": "27",
318
- "metadata": {},
 
 
319
  "outputs": [],
320
  "source": [
321
  "!gen3 pfb show -i $pfb_data schema | jq # show the schema of the PFB file"
@@ -324,7 +386,9 @@
324
  {
325
  "cell_type": "markdown",
326
  "id": "28",
327
- "metadata": {},
 
 
328
  "source": [
329
  "Now we've gone all the way from a dump of data contribution files, to a simple structured data model, to a serialized PFB, and back to usable files!"
330
  ]
@@ -332,11 +396,17 @@
332
  {
333
  "cell_type": "markdown",
334
  "id": "29",
335
- "metadata": {},
 
 
336
  "source": []
337
  }
338
  ],
339
  "metadata": {
 
 
 
 
340
  "kernelspec": {
341
  "display_name": "Python 3",
342
  "language": "python",
 
3
  {
4
  "cell_type": "markdown",
5
  "id": "0",
6
+ "metadata": {
7
+ "id": "0"
8
+ },
9
  "source": [
10
  "# Creation of Serialized File From AI Model Output\n",
11
  "---\n",
12
+ "This notebook demonstrates how to use the AI-assisted data model output (originally just a collection of TSV files) to a serialized file, a [PFB (Portable Format for Bioinformatics)](https://pmc.ncbi.nlm.nih.gov/articles/PMC10035862/) file.\n",
13
  "\n",
14
+ "PFB is widely used within NIH-funded initiatives that our center is a part of, as a means for efficient storage and transfer of data between systems."
 
15
  ]
16
  },
17
  {
18
  "cell_type": "markdown",
19
  "id": "1",
20
+ "metadata": {
21
+ "id": "1"
22
+ },
23
  "source": [
24
  "### Setup"
25
  ]
 
28
  "cell_type": "code",
29
  "execution_count": null,
30
  "id": "2",
31
+ "metadata": {
32
+ "colab": {
33
+ "base_uri": "https://localhost:8080/"
34
+ },
35
+ "id": "2",
36
+ "outputId": "93bf3200-e3e2-4607-b7fc-23de90f967e1"
37
+ },
38
  "outputs": [],
39
  "source": [
40
  "%pip install pandas gen3"
 
43
  {
44
  "cell_type": "markdown",
45
  "id": "3",
46
+ "metadata": {
47
+ "id": "3"
48
+ },
49
  "source": [
50
  "We need some helper files to demonstrate this, so pull them in from Huggingface."
51
  ]
 
54
  "cell_type": "code",
55
  "execution_count": null,
56
  "id": "4",
57
+ "metadata": {
58
+ "colab": {
59
+ "base_uri": "https://localhost:8080/"
60
+ },
61
+ "id": "4",
62
+ "outputId": "ca90e09b-4d66-4019-ea91-4f9694b246ec"
63
+ },
64
  "outputs": [],
65
  "source": [
66
+ "!git clone https://huggingface.co/spaces/uc-ctds/llama-data-model-generator-demo llama_data_model_generator_demo\n"
 
67
  ]
68
  },
69
  {
70
  "cell_type": "markdown",
71
  "id": "5",
72
+ "metadata": {
73
+ "id": "5"
74
+ },
75
  "source": [
76
  "### Imports and Initial Loading"
77
  ]
 
80
  "cell_type": "code",
81
  "execution_count": null,
82
  "id": "6",
83
+ "metadata": {
84
+ "id": "6"
85
+ },
86
  "outputs": [],
87
  "source": [
88
+ "from llama_data_model_generator_demo.utils import *\n",
89
  "import os\n",
90
  "from pathlib import Path\n",
91
  "import pandas as pd"
 
95
  "cell_type": "code",
96
  "execution_count": null,
97
  "id": "7",
98
+ "metadata": {
99
+ "id": "7"
100
+ },
101
  "outputs": [],
102
  "source": [
103
  "# read in the minimal Gen3 data model scaffold\n",
 
108
  {
109
  "cell_type": "markdown",
110
  "id": "8",
111
+ "metadata": {
112
+ "id": "8"
113
+ },
114
  "source": [
115
  "We are demonstrating the ability to use this against an AI-generated model, but not directly inferencing to get the data model. Instead we're using a Sythnetic Data Contribution (a sample of what a data contributor would provide AND the expected simplified data model). We use these to train and test the AI model. For simplicity, we're using the model here."
116
  ]
 
119
  "cell_type": "code",
120
  "execution_count": null,
121
  "id": "9",
122
+ "metadata": {
123
+ "id": "9"
124
+ },
125
  "outputs": [],
126
  "source": [
127
  "# Find the simplified data model in a Synthetic Data Contribution directory\n",
 
137
  "cell_type": "code",
138
  "execution_count": null,
139
  "id": "10",
140
+ "metadata": {
141
+ "id": "10"
142
+ },
143
  "outputs": [],
144
  "source": [
145
  "sdm = read_schema(schema=sdm_path)"
 
148
  {
149
  "cell_type": "markdown",
150
  "id": "11",
151
+ "metadata": {
152
+ "id": "11"
153
+ },
154
  "source": [
155
  "### Creation of Serialized File"
156
  ]
 
158
  {
159
  "cell_type": "markdown",
160
  "id": "12",
161
+ "metadata": {
162
+ "id": "12"
163
+ },
164
  "source": [
165
  "As of writing, PFB requires a Gen3-style data model, so the next steps are to ensure we can go from the simplified AI model output to a Gen3 data model. Note that in the future we may allow alternative, non-Gen3 models to create such PFBs."
166
  ]
 
169
  "cell_type": "code",
170
  "execution_count": null,
171
  "id": "13",
172
+ "metadata": {
173
+ "id": "13"
174
+ },
175
  "outputs": [],
176
  "source": [
177
  "## Create a Gen3 data model from the simplified data model\n",
 
191
  "cell_type": "code",
192
  "execution_count": null,
193
  "id": "14",
194
+ "metadata": {
195
+ "id": "14"
196
+ },
197
  "outputs": [],
198
  "source": [
199
  "## Write the Gen3-style data model to a JSON file\n",
 
207
  {
208
  "cell_type": "markdown",
209
  "id": "15",
210
+ "metadata": {
211
+ "id": "15"
212
+ },
213
  "source": [
214
  "Now we have the data model in proper format, we can serialize it into a PFB."
215
  ]
 
218
  "cell_type": "code",
219
  "execution_count": null,
220
  "id": "16",
221
+ "metadata": {
222
+ "id": "16"
223
+ },
224
  "outputs": [],
225
  "source": [
226
  "# Convert the Gen3-style data model to PFB format schema\n",
 
231
  {
232
  "cell_type": "markdown",
233
  "id": "17",
234
+ "metadata": {
235
+ "id": "17"
236
+ },
237
  "source": [
238
  "### PFB Utilities"
239
  ]
 
241
  {
242
  "cell_type": "markdown",
243
  "id": "18",
244
+ "metadata": {
245
+ "id": "18"
246
+ },
247
  "source": [
248
  "Now we can demonstrate creation of a PFB when you have content for it (in this case in the form of TSV metadata). The above is a PFB which contains only the data model."
249
  ]
 
252
  "cell_type": "code",
253
  "execution_count": null,
254
  "id": "19",
255
+ "metadata": {
256
+ "id": "19"
257
+ },
258
  "outputs": [],
259
  "source": [
260
  "# Get a list of TSV files in the sdm_dir\n",
 
266
  "cell_type": "code",
267
  "execution_count": null,
268
  "id": "20",
269
+ "metadata": {
270
+ "id": "20"
271
+ },
272
  "outputs": [],
273
  "source": [
274
  "# calculate tsv file size and md5sum for each tsv_files\n",
 
302
  "cell_type": "code",
303
  "execution_count": null,
304
  "id": "21",
305
+ "metadata": {
306
+ "id": "21"
307
+ },
308
  "outputs": [],
309
  "source": [
310
  "%ls -l $sdm_dir/tsv_metadata"
 
314
  "cell_type": "code",
315
  "execution_count": null,
316
  "id": "22",
317
+ "metadata": {
318
+ "id": "22"
319
+ },
320
  "outputs": [],
321
  "source": [
322
  "tsv_metadata"
 
326
  "cell_type": "code",
327
  "execution_count": null,
328
  "id": "23",
329
+ "metadata": {
330
+ "id": "23"
331
+ },
332
  "outputs": [],
333
  "source": [
334
  "pfb_data = os.path.join(sdm_dir, Path(out_file).stem + \"_data.avro\")\n",
 
340
  {
341
  "cell_type": "markdown",
342
  "id": "24",
343
+ "metadata": {
344
+ "id": "24"
345
+ },
346
  "source": [
347
  "PFB contains a utility to convert from the serialized format to more readable and workable files, including TSVs. Here we demonstrate that utility:"
348
  ]
 
351
  "cell_type": "code",
352
  "execution_count": null,
353
  "id": "25",
354
+ "metadata": {
355
+ "id": "25"
356
+ },
357
  "outputs": [],
358
  "source": [
359
  "!gen3 pfb to -i $pfb_data tsv # convert the PFB file to TSV format"
 
363
  "cell_type": "code",
364
  "execution_count": null,
365
  "id": "26",
366
+ "metadata": {
367
+ "id": "26"
368
+ },
369
  "outputs": [],
370
  "source": [
371
  "!gen3 pfb show -i $pfb_data # show the contents of the PFB file"
 
375
  "cell_type": "code",
376
  "execution_count": null,
377
  "id": "27",
378
+ "metadata": {
379
+ "id": "27"
380
+ },
381
  "outputs": [],
382
  "source": [
383
  "!gen3 pfb show -i $pfb_data schema | jq # show the schema of the PFB file"
 
386
  {
387
  "cell_type": "markdown",
388
  "id": "28",
389
+ "metadata": {
390
+ "id": "28"
391
+ },
392
  "source": [
393
  "Now we've gone all the way from a dump of data contribution files, to a simple structured data model, to a serialized PFB, and back to usable files!"
394
  ]
 
396
  {
397
  "cell_type": "markdown",
398
  "id": "29",
399
+ "metadata": {
400
+ "id": "29"
401
+ },
402
  "source": []
403
  }
404
  ],
405
  "metadata": {
406
+ "colab": {
407
+ "provenance": [],
408
+ "toc_visible": true
409
+ },
410
  "kernelspec": {
411
  "display_name": "Python 3",
412
  "language": "python",