4
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

Uzushio: a preprocessing tool for LLM pretraining corpora

Last updated at Posted at 2023-12-11

Series

This is an article in a series about Uzushio: Japanese Corpus Preprocessing Tool.

  1. Uzushio: a preprocessing tool for LLM pretraining corpora
  2. Uzushio: Text Extraction
  3. Uzushio: Filters

この記事の日本語版は後ほど投稿いたします。

What is Uzushio

We open source Uzushio: an Apache Spark-based huge corpus preprocessing tool which was in development from summer this year.

Large language and other foundational models often require a multi-billion-token training corpus, and web data, e.g. Common Crawl, is often used as a training corpus for these models.

Because of the large scale, processing such corpora can require significant resources.
Additionally, some required stages for processing such corpora, e.g. near duplicate detection can be non-trivial.

This article gives a brief introduction to the Uzushio and its processing steps.
We will follow up on how to use articles and dive-ins on the internals of the individual processing steps.

Stages

The main processing pipeline of Uzushio consists of the following stages.
Each stage is an invididual Spark job.

  1. Document extraction
  2. Near-duplicate detection
  3. Duplicate statistics merging
  4. Filtering

Currently, all stages can be executed both in cloud (e.g. in AWS EMR) and in HPC environments (we provide configuration for the ABCI system).

Document Extraction

In this stage Uzushio extracts text data from raw HTML documents.
We extract text in paragraph-like units with paragraphs being defined by HTML markup.
Namely, tags like <div>, <p> are treated as paragraph boundaries.

In addition to text data, we also remember HTML path for each paragraph.
We record information which can be used in CSS selectors in the HTML path, e.g. body>div#content>p.text.
Finally, we also record if a text span is a link or not.

This rich metadata helps us to make better content filters.

Currently text extraction is focused on languages with non-romanic scripts.
It would be not difficult to adopt it to English and other languages with romanic scripts,
but this is out of scope of the initial implementation which is focused on Japanese.

Near Duplicate Detection

This stage finds whether any two paragraphs in the corpus are near duplicates.
By near duplicates we define that they are mostly the same sequences of characters.

The objective of a near duplicate detection step is to compute the number of occurences for each paragraph in the text with a relaxed condition that paragraphs can be not exactly the same.
This steps produces duplicate statistics data: mapping from the paragraph 64-bit hash to its occurence count.

Duplicate Statistics Merging

This stage merges the duplicate statistics for several corpora into a single dataset.
Current near-duplicate detection implementation is relatively scalable -- it is O(N log N) where N is the number of paragraphs in the corpus.
Still, it requires significant computational resources if to run it on TB-scale corpora at once.
To make the Uzushio more easily usable on medium-scale systems,
we make it possible to process the parts of corpus,
producing per-part duplicate statistics, merging them into statistics for larger corpora,
and finally using the merged data for the final filtering.

Filtering

The final stage uses the extracted text data and optionally duplicate statistics to filter out low quality documents.
Our implementation can perform the actual filtering in two ways:

  • outputting only documents which did not were not rejected by any filter
  • outputting all documents, grouping them by the filter they were rejected by

Using the filtering stage in the second mode allows to mix and match documents of different quality,
after the filtering process, making a trade-off between corpus quality with corpus size.

This is a current list of filters:

  • Duplicate documents subsampling
  • High frequency subsequent paragraph trimming
  • Document compression rate
  • Hiragana rate
  • Document link ratio
  • Navigation-like paragraph trimming
  • Word types
  • Word instances

Debug stages

In addition to the above-mentioned stages, Uzushio also contains several Spark jobs which are used for debugging.

The most important ones are filter-debugging ones.
Most of filters compute some metric and use thresholds to either accept or reject a document or paragraph.

These debug stages compute the metric for all documents and ouptut documents sorted by the metric.
This output can be used for visualization (seeing the metric distribution) and chosing the filtering thresholds.

Work in progress

While the most of the work is done, there are some things which are still in progress.
One is the filtering corpus by n-gram language model.
We plan add two filters: one is for per-document perplexity and one is for per-paragraph perplexity.

Currently Uzushio is focused on Japanese and do not provide rich set of tools for other languages.
We hope that in the future the situation will improve.

Conclusion

We hope that Uzushio will be useful for Japanese and wide NLP community.

4
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
4
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?