Spaces:

omkarenator
/

fh-new

Sleeping

App Files Files Community

omkarenator commited on Sep 25, 2024

Commit

61deef0

1 Parent(s): 370837e

all data views in

Browse files

Files changed (1) hide show

web.py +39 -51

web.py CHANGED Viewed

@@ -9,7 +9,7 @@ from data.url_blocklist import urls_high_matches, urls_false_positives
 from data.non_web_urls import non_web_urls
-def view_data_static(
     left,
     header,
 ):
@@ -28,7 +28,7 @@ def view_data_static(
     return Div(H3(header), data_display, style="margin-top: 10px;")
-def view_data(
     left_file,
     doc_id,
     header,
@@ -79,7 +79,7 @@ def view_data(
     return Div(form, data_display, style="margin-top: 10px;", id=target)
-def view_data_2col(
     left_file,
     right_file,
     doc_id,
@@ -149,7 +149,7 @@ def update(target: str, request):
     right_file = params.get("right_file")
     if left_file and right_file:
         return (
-            view_data_2col(
                 left_file,
                 right_file,
                 doc_id,
@@ -157,7 +157,7 @@ def update(target: str, request):
             ),
         )
     else:
-        return view_data(
             left_file,
             doc_id,
             params.get("header"),
@@ -206,18 +206,18 @@ def web_data():
         we found WET files to include boilerplate content like navigation menus, ads, and other irrelevant texts.
         Accordingly, our pipeline starts from raw WARC files, reads with the warcio library, and extracts texts using trafilatura.
         """),
-        view_data_2col("data/sample_wet.json", "data/sample_warc.json", 3),
         H4("1.2 Language Identification"),
         P("""
         After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
         This step removes over 60% of the whole data.
         """),
-        view_data(
             "data/sample_non_en.json",
             3,
             "Sample documents that are classified as non-English",
         ),
-        view_data(
             "data/sample_en_low.json",
             3,
             "Sample documents that are classified as English but with score less than 0.65",
@@ -233,14 +233,12 @@ def web_data():
         articles, sex education, technical blogs, etc. Specifically, we randomly took 903M URLs and matched them with
         4.6M domain names in the UT1 blocklist. 24 URL domains were detected with more than 4k matches, which are shown below.
         """),
-        view_data_static(urls_high_matches, "24 URL domains with more than 4k matches"),
         P("""
         We manually removed the following 6 domains from the UT1 blocklist so that they will not be removed from our dataset.
         """),
-        view_data_static(
-            urls_false_positives, "6 url domains that are removed from the blocklist"
-        ),
-        view_data(
             "data/bad_url_doc.jsonl",
             3,
             "Sample documents whose urls are blocked by the refined url blocklist",
@@ -249,11 +247,11 @@ def web_data():
         P("""
         To avoid duplication with our high-quality curated datasets, we exclude the following domains from our dataset.
         """),
-        view_data_static(
             non_web_urls,
             "curated url domains that are excluded from our dataset",
         ),
-        view_data(
             "data/sample_url_exclusion.json",
             0,
             "Sample documents whose urls are in our curated url domain list",
@@ -272,7 +270,7 @@ def web_data():
         of 56,292 additional lines, resulting in the complete exclusion of 2,203 documents from a total of 13,560
         documents (16.25%). Accordingly, we choose to not use terminal punctuation as a signal to remove lines.
         """),
-        view_data(
             "data/sample_terminal_punc.json",
             0,
             "Sample documents with lines that are removed by the rule of terminal punctuation",
@@ -285,7 +283,7 @@ def web_data():
         propose to refine the strategy by adding one more keyword to the word "javascript" to avoid false positives.
         The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
         """),
-        view_data(
             "data/sample_java.jsonl",
             0,
             "Sample documents that are removed by original C4 javascript rule but are kept after our refinement",
@@ -298,7 +296,7 @@ def web_data():
         - The line matches the pattern “r'^\\d+\\s+likes$'”,
         - The line contains only one word.
         """),
-        view_data(
             "data/sample_refinedweb_line.json",
             0,
             "Sample documents with lines that are removed by the RefinedWeb rules",
@@ -311,7 +309,7 @@ def web_data():
         line is in the first 3 lines or in the last 3 lines) to remove toxic lines. Specifically, we do not only consider
         the bad words from English but also consider the bad words from other languages.
         """),
-        view_data_static(
             json.load(open("data/toxic_lines.json")),
             "Sample documents with toxic lines",
         ),
@@ -319,7 +317,7 @@ def web_data():
         P("""
         In this section, we introduce all the quality signals that we have used to filter out low-quality documents.
         Overview of all the quality signals that are used for filtering."""),
-        view_data_static(
             json.load(open("data/all_signals.json")),
             "Overview of all the quality signals that are used for filtering",
         ),
@@ -368,9 +366,10 @@ def web_data():
         ensures consistency with the overall document character count calculation.
         """),
         H5("Our Implementation"),
-        Img(
-            src="path/to/sample_filtered_lines.png",
-            alt="Sample documents filtered by excessive line repetitions / characters in repeated lines",
         ),
         H5("3.1.2 Fraction of Characters in the Most Common N-grams (n=2,3,4)"),
         P("""
@@ -394,9 +393,10 @@ def web_data():
         only once — tend to be short.
         """),
         H5("Our Implementations"),
-        Img(
-            src="path/to/sample_common_ngrams.png",
-            alt="Sample documents filtered by the fraction of characters in the most common n-grams (n=2,3,4)",
         ),
         H5("3.1.3 Fraction of Characters in Duplicated N-grams (n=5,...,10)"),
         P("""
@@ -423,18 +423,15 @@ def web_data():
         We decided to use the RedPajama V2 implementation but skip the 1st occurrence of the duplicate n-gram.
         """),
         H5("Our Implementations"),
-        Img(
-            src="path/to/sample_dup_ngrams.png",
-            alt="Sample documents filtered by the fraction of characters in duplicated n-grams (n=5,...,10)",
-        ),
         H5("An Example to Show the Difference Between Above Implementations"),
         P("..."),  # Add specific examples if available
         H5(
             "Sample Documents Filtered by the Fraction of Characters in Duplicated N-grams (n=5,...,10)"
         ),
-        Img(
-            src="path/to/sample_dup_ngrams_filtered.png",
-            alt="Sample documents filtered by the fraction of characters in duplicated n-grams (n=5,...,10)",
         ),
         H4("3.2 Line-wise Heuristics"),
         P("""
@@ -443,9 +440,10 @@ def web_data():
         works ([2], [3], [6]), we remove the documents if more than 30% of the lines end with an ellipsis or more than
         90% of lines start with a bullet point.
         """),
-        Img(
-            src="path/to/sample_line_weirdness_removed.png",
-            alt="Sample documents that are filtered out by line-wise heuristics",
         ),
         H4("3.3 Statistics-based Heuristics"),
         P("""
@@ -505,10 +503,6 @@ median_word_length = median(len(word) for word in words)
         The only publicly available implementation of this quality signal is from RedPajama V2, which uses regular expressions
         to split text into sentences.
         """),
-        Img(
-            src="path/to/sample_sentences_split.png",
-            alt="Sample documents split into sentences",
-        ),
         P("""
         However, we found that this approach can mistakenly interpret periods in URLs as sentence endings. To address this,
         we opted to use `nltk.tokenize.sent_tokenize` for more accurate sentence splitting.
@@ -522,10 +516,6 @@ median_word_length = median(len(word) for word in words)
         Following RedPajama-V2 and DataTrove, we use the symbols of ("#", "...", "…").
         We calculate the ratio as the number of symbols divided by the total number of words.
         """),
-        Img(
-            src="path/to/sample_symbol_word_ratio.png",
-            alt="Sample documents filtered by symbol-to-word ratio",
-        ),
         H5("Fraction of Alphabetic Words"),
         P("""
         Implementations from Dolma
@@ -549,19 +539,17 @@ median_word_length = median(len(word) for word in words)
             alt="Sample documents filtered by number of stop words",
         ),
         H5("Our Implementations"),
-        Img(
-            src="path/to/sample_statistics_based_filters.png",
-            alt="Sample documents that are filtered out by statistics-based heuristics",
         ),
         H4("3.4 Others"),
         P("""
         Following C4, we remove any page where the phrase “lorem ipsum” appeared since some pages had placeholder “lorem ipsum”
         text.
         """),
-        Img(
-            src="path/to/sample_lorem_ipsum.png",
-            alt="Sample documents containing 'lorem ipsum'",
-        ),
         H3("4. Deduplication"),
         P("..."),  # Add detailed content and images as needed
         H3("5. PII Removal"),

 from data.non_web_urls import non_web_urls
+def DVS(
     left,
     header,
 ):
     return Div(H3(header), data_display, style="margin-top: 10px;")
+def DV(
     left_file,
     doc_id,
     header,
     return Div(form, data_display, style="margin-top: 10px;", id=target)
+def DV2(
     left_file,
     right_file,
     doc_id,
     right_file = params.get("right_file")
     if left_file and right_file:
         return (
+            DV2(
                 left_file,
                 right_file,
                 doc_id,
             ),
         )
     else:
+        return DV(
             left_file,
             doc_id,
             params.get("header"),
         we found WET files to include boilerplate content like navigation menus, ads, and other irrelevant texts.
         Accordingly, our pipeline starts from raw WARC files, reads with the warcio library, and extracts texts using trafilatura.
         """),
+        DV2("data/sample_wet.json", "data/sample_warc.json", 3),
         H4("1.2 Language Identification"),
         P("""
         After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
         This step removes over 60% of the whole data.
         """),
+        DV(
             "data/sample_non_en.json",
             3,
             "Sample documents that are classified as non-English",
         ),
+        DV(
             "data/sample_en_low.json",
             3,
             "Sample documents that are classified as English but with score less than 0.65",
         articles, sex education, technical blogs, etc. Specifically, we randomly took 903M URLs and matched them with
         4.6M domain names in the UT1 blocklist. 24 URL domains were detected with more than 4k matches, which are shown below.
         """),
+        DVS(urls_high_matches, "24 URL domains with more than 4k matches"),
         P("""
         We manually removed the following 6 domains from the UT1 blocklist so that they will not be removed from our dataset.
         """),
+        DVS(urls_false_positives, "6 url domains that are removed from the blocklist"),
+        DV(
             "data/bad_url_doc.jsonl",
             3,
             "Sample documents whose urls are blocked by the refined url blocklist",
         P("""
         To avoid duplication with our high-quality curated datasets, we exclude the following domains from our dataset.
         """),
+        DVS(
             non_web_urls,
             "curated url domains that are excluded from our dataset",
         ),
+        DV(
             "data/sample_url_exclusion.json",
             0,
             "Sample documents whose urls are in our curated url domain list",
         of 56,292 additional lines, resulting in the complete exclusion of 2,203 documents from a total of 13,560
         documents (16.25%). Accordingly, we choose to not use terminal punctuation as a signal to remove lines.
         """),
+        DV(
             "data/sample_terminal_punc.json",
             0,
             "Sample documents with lines that are removed by the rule of terminal punctuation",
         propose to refine the strategy by adding one more keyword to the word "javascript" to avoid false positives.
         The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
         """),
+        DV(
             "data/sample_java.jsonl",
             0,
             "Sample documents that are removed by original C4 javascript rule but are kept after our refinement",
         - The line matches the pattern “r'^\\d+\\s+likes$'”,
         - The line contains only one word.
         """),
+        DV(
             "data/sample_refinedweb_line.json",
             0,
             "Sample documents with lines that are removed by the RefinedWeb rules",
         line is in the first 3 lines or in the last 3 lines) to remove toxic lines. Specifically, we do not only consider
         the bad words from English but also consider the bad words from other languages.
         """),
+        DVS(
             json.load(open("data/toxic_lines.json")),
             "Sample documents with toxic lines",
         ),
         P("""
         In this section, we introduce all the quality signals that we have used to filter out low-quality documents.
         Overview of all the quality signals that are used for filtering."""),
+        DVS(
             json.load(open("data/all_signals.json")),
             "Overview of all the quality signals that are used for filtering",
         ),
         ensures consistency with the overall document character count calculation.
         """),
         H5("Our Implementation"),
+        DV(
+            "data/repeat_line_frac.jsonl",
+            0,
+            "Sample documents filtered by excessive line repetitions / characters in repeated lines",
         ),
         H5("3.1.2 Fraction of Characters in the Most Common N-grams (n=2,3,4)"),
         P("""
         only once — tend to be short.
         """),
         H5("Our Implementations"),
+        DV(
+            "data/sample_top_ngram.json",
+            0,
+            "Sample documents filtered by the fraction of characters in the most common n-grams (n=2,3,4)",
         ),
         H5("3.1.3 Fraction of Characters in Duplicated N-grams (n=5,...,10)"),
         P("""
         We decided to use the RedPajama V2 implementation but skip the 1st occurrence of the duplicate n-gram.
         """),
         H5("Our Implementations"),
         H5("An Example to Show the Difference Between Above Implementations"),
         P("..."),  # Add specific examples if available
         H5(
             "Sample Documents Filtered by the Fraction of Characters in Duplicated N-grams (n=5,...,10)"
         ),
+        DV(
+            "data/sample_dup_ngram.json",
+            0,
+            "Sample documents filtered by the fraction of characters in duplicated n-grams (n=5,...,10)",
         ),
         H4("3.2 Line-wise Heuristics"),
         P("""
         works ([2], [3], [6]), we remove the documents if more than 30% of the lines end with an ellipsis or more than
         90% of lines start with a bullet point.
         """),
+        DV(
+            "data/line_info.json",
+            0,
+            "Sample documents that are filtered out by line-wise heuristics",
         ),
         H4("3.3 Statistics-based Heuristics"),
         P("""
         The only publicly available implementation of this quality signal is from RedPajama V2, which uses regular expressions
         to split text into sentences.
         """),
         P("""
         However, we found that this approach can mistakenly interpret periods in URLs as sentence endings. To address this,
         we opted to use `nltk.tokenize.sent_tokenize` for more accurate sentence splitting.
         Following RedPajama-V2 and DataTrove, we use the symbols of ("#", "...", "…").
         We calculate the ratio as the number of symbols divided by the total number of words.
         """),
         H5("Fraction of Alphabetic Words"),
         P("""
         Implementations from Dolma
             alt="Sample documents filtered by number of stop words",
         ),
         H5("Our Implementations"),
+        DV(
+            "data/sample_doc_stat.json",
+            0,
+            "Sample documents that are filtered out by statistics-based heuristics",
         ),
         H4("3.4 Others"),
         P("""
         Following C4, we remove any page where the phrase “lorem ipsum” appeared since some pages had placeholder “lorem ipsum”
         text.
         """),
+        DV("data/lorem_ipsum.json", 0, "Sample documents containing 'lorem ipsum"),
         H3("4. Deduplication"),
         P("..."),  # Add detailed content and images as needed
         H3("5. PII Removal"),