LoginSignup
2
1

Series

This is an article in a series about Uzushio: Japanese Corpus Preprocessing Tool.

  1. Uzushio: a preprocessing tool for LLM pretraining corpora
  2. Uzushio: Text Extraction
  3. Uzushio: Filters

Uzushio: Filters

Uzushio filters form a filter chain.
The chain can contain several filters, and each of the filters applies to a document in an order.
A filter can transform a document object, for example marking it for deletion or making another document modification.
Most of the filters that are described in this article only mark either a document as a whole or several paragraphs for deletion.
Filters are applied one by one, in the declaration order.
Documents that were rejected by a filter are considered rejected and the downstream filters in the filter chain would not be applied to them.

The filtering task is implemented, as usual, as an Apache Spark job and has two output modes:

  • Output only documents and their paragraphs which were not filtered out,
  • Output all documents, marking the filter that was used for filtering a paragraph or a document

The second mode is useful for debugging and making filter chains which outputs data of several quality gradations.
For example, the best data would be documents that pass all filters, and the second best data would remain on the last filter, and so on.

Filtering Framework

Filters are configured using the HOCON configuration language.
We show examples mostly in JSON as it is a subset of HOCON.
Filters are defined as an array as shown in the example below.

"filters": [
    {"class": "Filter1"},
    {"class": "Filter2", "param1": 5},
]

Each filter is defined by its class.
Class names can be either fully defined or the default filter package com.worksap.nlp.uzushio.lib.filters can be omitted.
Filters can have parameters, we describe available parameters for each filter.

Some filters utilize pseudorandom numbers.
However, to make Uzushio produce deterministic results, pseudorandom numbers are generated based on a seed derived from document ids.
This means that no matter the partitioning and processing order, each document will always get a deterministic random number stream.

Duplicate Documents Subsampling

The main objective of this filter is to perform corpus deduplication.
For this, we estimate the duplication count of the document from the paragraph near duplicate frequency and then sample documents so the expected number of the documents would be equal to the provided number.

We estimate the duplication count of the document by using percentiles, not averages, because percentiles are more robust to outliers and most documents contain at least some extremely frequent paragraphs.
We use low, but not very low percentile to ignore false negatives which can occur during the near duplicate detection step.

Example

{"class": "DeduplicateDocumentsPercentile", "expected": 5, "percentile": 0.05}

This filter will probabilistically remove documents if the near frequency of a paragraph in the document on the 5th percentile is greater than 5.
The documents will be sampled so that the expected number of the offending paragraph in the corpus would be 5.

Parameters:

  • expected: (float, default=1), subsample documents to have document duplication count to be this expected value
  • percentile: (float, default=0.05), use this percentile to estimate document duplication count

High-Frequency Subsequent Paragraph Trimming

This filter also uses paragraph near duplication frequency statistic data, but it does it more directly.
We trim paragraphs if multiple successive ones have a frequency higher than a specified threshold.

The goal of this filter is to remove text that is present on a large number of pages, like navigation or advertisements,
but leave in place the main content, which should be less duplicated over the corpus.
As a safety mechanism to prevent over-deletion of text, this filter supports removing paragraphs only if a specified and subsequent number of them have high near frequencies.

Example

{"class": "LargeFreqParagraphs", "count": 3, "freq": 100}

This filter will remove paragraphs from a document if in succession more than 3 paragraphs would have near frequency larger than 100.
The remaining paragraphs would remain in the original document.

Parameters:

  • freq: (int, default=100), paragraph near frequency threshold to consider paragraph for removal
  • count: (int, default=3), remove paragraphs only if count successive paragraphs have their frequency greater than threshold. Each document is implicitly surrounded by the unlimited number of paragraphs with unlimited frequency.

Example Filtered Documents

Following two files contain extracts of documents with the specified parameters.
Paragraph frequencies were estimated on the whole Common Crawl corpus.

Both of the examples contain frequent string which appear on the multiple pages at once.
Most of the content are navigation-like data, headers, or footers.
Such content is not useful for training LLMs or foundational models.

Document Compression Rate

One good idea on filtering suggested by Oda Yusuke is based on the observation that especially in web corpus, texts with different characteristics have different compression rates.
Namely, we compute the ratio of compressed text to uncompressed text and filter out documents that have lower or higher compression rates than the provided thresholds.
We use the LZ4 algorithm to compress document text data as it has very fast compression speeds.

Documents with a low compression ratio are mostly low-quality pages with copied content and are not useful for LLM training.
Pages with very high compression ratios mostly contain various lists.

Example

{"class": "CompressionRate", "low": 0.40, "high": 0.75}

This filter will remove documents that have an LZ4 compression rate (compressed size divided by uncompressed size) below 0.4 and above 0.75.

Parameters:

  • low: (float, default=0), documents having an LZ4 compression rate below this number will be removed
  • high: (float, default=1), documents having an LZ4 compression rate above this number will be removed

Example Filtered Documents

In the first group there are documents with many repetitions

In the second group there are documents with very low compression ratio, containing list-like content

Hiragana Ratio

Another good heuristic to distinguish low-quality Japanese documents from other ones is the ratio of hiragana characters in text.
While regular text contains hiragana, lists, advertisements, and other low-quality text contains a very low percentage of it.
For this filter, we measure the ratio of hiragana characters to the total number of characters in a document.

Example

{"class": "HiraganaRatio", "low": 0.1, "high": 1.0}

This filter will remove documents that have a hiragana ratio below 0.1 and above 1.0.

Parameters:

  • low: (float, default=0), documents having a hiragana ratio below this number will be removed
  • high: (float, default=1), documents having a hiragana ratio above this number will be removed

Example Filtered Documents

With the increase of the hiragana ratio documents become more Japanese-like, but in all of ranges contain a lot of low-quality documents.

Document Link Ratio

The percentage of link text is another useful criterion for distinguishing low-quality documents.
The text extraction step of Uzushio records the spans of text that were link bodies in original documents.
We use that information to compute the link ratio as the number of characters that were in the links to the total number of characters in the document.

Documents that contain many links are mostly low-quality advertisement spaces or link farms and regular pages do not contain that many links.
This filter helps us to remove such pages.

Example

{"class": "LinkCharRatio", "low": 0, "high": 0.8}

This filter will remove documents that have a link ratio above 0.8.

Parameters:

  • low: (float, default=0), documents having a link ratio below this number will be removed
  • high: (float, default=1), documents having a link ratio above this number will be removed

Example Filtered Documents

Word List Filtering

Documents can contain bad words which are not suitable for training LLMs.
For example, the web contains a high number of adult-related sites.
Such documents are usually removed using word lists.

We provide two types of such filters: one counts all instances of words, another counts only types of included words.
Implementation of both creates a trie representation of the word lists internally and can support large word lists effectively.
However, we do not do any other filtering on word detection and any substring inclusion will be counted as a word.
For example, HojiChar checks that there is a different character category on the detected word boundaries.

For corpus filtering, we adopted word lists developed by the HojiChar project, removing words that caused a lot of false positives.

Word Types Filter Example

 {"class": "WordTypes", "threshold": 9, "kind": "uniq", "list": "hojichar/adult_keywords_ja.txt"}

This filter will remove all documents that contain more than 9 distinct words from the specified word list.
Each word can be included multiple times, it will be counted only once.

Parameters:

  • list: (string, required), the path to a word list, can be either on the classpath or on a filesystem accessible to all executors
  • threshold: (float, default=3), if the total score exceeds this number the document is deleted
  • kind: (string, default=uniq), sets word scoring mode, described below

Filter supports different word scoring modes:

  • uniq counts each word once, no matter the number of times the word was used
  • sqrt counts each word as the square root of the actual number of times the word was used. For example, if the first word was used 4 times and the second 9 times, the total score to compare against threshold would be sqrt(4) + sqrt(9) = 2 + 3 + 5.
  • log10 counts each word as 1 + log_10(freq) where freq is the original frequency of the word. It gives a larger discount to the word counts than the sqrt mode, but not as much as the uniq mode.

Word Instances Filter Example

This filter counts each inclusion of a substring from the given word list as a different word occurrence and increases the score based on that, independent of other words.

{"class": "WordInstances", "threshold": 10, "full": 1.0, "list": "hojichar/adult_keywords_ja.txt"}

In this configuration, documents that have 10 word occurrences (e.g. single word used 10 times, 10 words used 1 time each, or 5 words used 2 times each) are deleted.

Parameters:

  • list: (string, required), a path to a word list, can be either on the classpath or on a filesystem accessible to all executors
  • threshold: (float, default=3), if the total score exceeds this number the document is deleted
  • full: (float, default=1), set the score for each substring occurrence

Example Filtered Documents

Warning: these pages contain offensive content.

N-gram Based Language Model Filters

One of the strongest filters we implement is the n-gram language model filter which allows users to evaluate per-document or per-paragraph average perplexity and delete documents or paragraphs for which the evaluated perplexity is larger than provided thresholds.
The filter uses Sudachi for tokenization and the KenLM library for the n-gram language model implementation.

We provide two variations on the filter, similarity to filters that operate on duplication frequency: per-document and per-paragraph ones.
Both variations are similar in their behavior to deduplication filters: the per-document filter evaluates the per-document metric and deletes the document based on that metric, per-paragraph filter evaluates the per-paragraph metric and makes paragraph deletion decision if several paragraphs in a row have an average perplexity larger than a provided threshold.

In addition to that, we provide an option to ignore outliers when evaluating perplexity.
We allow users to ignore up to a certain percentage of tokens with the highest tokenwise perplexity when computing the per-paragraph average perplexity.

Per-Document Example

In the per-document filter, perplexity is estimated per paragraph internally.
The resulting per-document perplexity is a weighted sum of per-paragraph perplexities, weighted by the paragraph length.

{"class": "KenLMDocAvgPerplexity", "sudachi": ${sudachi}, "kenlm": ${kenlm}, "outliers": 0.1, "high": 1e6, "low": 5}

Parameters:

  • sudachi: (string, required), path to sudachi binary dictionary that will be used for tokenization
  • kenlm: (string, required), a path to the kenlm-compatible language model
  • outliers: (float, default=0), if > 0, then the given percentage of tokens with the highest perplexity (up to a certain limit) would not be used for estimating per-paragraph perplexity
  • high: (float, default=1e6), documents with an average perplexity higher than this number will be deleted
  • low: (float, default=0), documents with an average perplexity lower than this number will be deleted

Per-Paragraph Example

In the per-document filter, perplexity is estimated per paragraph internally.
The resulting per-document perplexity is a weighted sum of per-paragraph perplexities, weighted by the paragraph length.

{"class": "KenLMParagraphPerplexity", "sudachi": ${sudachi}, "kenlm": ${kenlm}, "outliers": 0.1, "count": 3, "threshold": 1e6}

Parameters:

  • sudachi: (string, required), path to sudachi binary dictionary that will be used for tokenization
  • kenlm: (string, required), a path to a kenlm-compatible language model
  • outliers: (float, default=0.02), if > 0, then the given percentage of tokens with the highest perplexity (up to a certain limit) would not be used for estimating per-paragraph perplexity
  • count: (int, default=3), remove paragraphs only if count successive paragraphs have their frequency greater than threshold
  • threshold: (float, default=1e6), consider paragraphs for removal if their average perplexity is greater than this number

Example Filtered Documents

Document Length Filter

Documents that are very short do not contain any meaningful content.
This filter removes documents that have a character length lower than the specified threshold.

Example

{"class": "DocLength", "low": 50}

This filter will remove all documents below 50 symbols

Parameters:

  • low: (int, default=0), documents character length below this number will be removed
  • high: (int, default=INT_MAX), documents having a character length above this number will be removed

Example Filtered Documents

Navigation-like Paragraph Trimming

We apply hand-crafted rules to filter out navigation-like paragraphs from the document.
Those rules use the CSS selector data attached to each paragraph.

Example

{"class": "NoContentDOM"}

Example Filtered Documents

Markdownization

LLMs usually use Markdown-inspired syntax for their output.
Because of this, we transform some HTML elements to their Markdown analogs using the filtering framework.
Those filters have no parameters.

Examples

{"class": "MergeListTag"}

Will transform lists (<li> or <option> tags) to Markdown syntax (- entry).

{"class": "MarkdownizeHeading"}

Will transform HTML headers (e.g. <h1> tags) to their Markdown equivalent (# Headers).
Incorrect heading levels, however, are left as they are.

2
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
2
1