victormiller commited on
Commit
146aa07
·
verified ·
1 Parent(s): 466af30

Update web.py

Browse files
Files changed (1) hide show
  1. web.py +4 -1
web.py CHANGED
@@ -216,6 +216,8 @@ def web_data():
216
  style="margin-top: 20px;",
217
  ),
218
  H3("1. Document Preparation"),
 
 
219
  H4("1.1 Text Extraction"),
220
  P("""
221
  Common Crawl provides webpage texts via two formats: WARC (Web ARChive format) and WET (WARC Encapsulated Text).
@@ -224,7 +226,8 @@ def web_data():
224
  we found WET files to include boilerplate content like navigation menus, ads, and other irrelevant texts.
225
  Accordingly, our pipeline starts from raw WARC files, reads with the warcio library, and extracts texts using trafilatura.
226
  """),
227
- DV2("data/sample_wet.json", "data/sample_warc.json", 3),
 
228
  H4("1.2 Language Identification"),
229
  P("""
230
  After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
 
216
  style="margin-top: 20px;",
217
  ),
218
  H3("1. Document Preparation"),
219
+
220
+ button( Div(
221
  H4("1.1 Text Extraction"),
222
  P("""
223
  Common Crawl provides webpage texts via two formats: WARC (Web ARChive format) and WET (WARC Encapsulated Text).
 
226
  we found WET files to include boilerplate content like navigation menus, ads, and other irrelevant texts.
227
  Accordingly, our pipeline starts from raw WARC files, reads with the warcio library, and extracts texts using trafilatura.
228
  """),
229
+ DV2("data/sample_wet.json", "data/sample_warc.json", 3),), cls="collapsible"),
230
+
231
  H4("1.2 Language Identification"),
232
  P("""
233
  After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.