Series
This is an article in a series about Uzushio: Japanese Corpus Preprocessing Tool.
この記事の日本語版は後ほど投稿いたします。
What is Uzushio
We open source Uzushio: an Apache Spark-based huge corpus preprocessing tool which was in development from summer this year.
Large language and other foundational models often require a multi-billion-token training corpus, and web data, e.g. Common Crawl, is often used as a training corpus for these models.
Because of the large scale, processing such corpora can require significant resources.
Additionally, some required stages for processing such corpora, e.g. near duplicate detection can be non-trivial.
This article gives a brief introduction to the Uzushio and its processing steps.
We will follow up on how to use articles and dive-ins on the internals of the individual processing steps.
Stages
The main processing pipeline of Uzushio consists of the following stages.
Each stage is an invididual Spark job.
- Document extraction
- Near-duplicate detection
- Duplicate statistics merging
- Filtering
Currently, all stages can be executed both in cloud (e.g. in AWS EMR) and in HPC environments (we provide configuration for the ABCI system).
Document Extraction
In this stage Uzushio extracts text data from raw HTML documents.
We extract text in paragraph-like units with paragraphs being defined by HTML markup.
Namely, tags like <div>
, <p>
are treated as paragraph boundaries.
In addition to text data, we also remember HTML path for each paragraph.
We record information which can be used in CSS selectors in the HTML path, e.g. body>div#content>p.text
.
Finally, we also record if a text span is a link or not.
This rich metadata helps us to make better content filters.
Currently text extraction is focused on languages with non-romanic scripts.
It would be not difficult to adopt it to English and other languages with romanic scripts,
but this is out of scope of the initial implementation which is focused on Japanese.
Near Duplicate Detection
This stage finds whether any two paragraphs in the corpus are near duplicates.
By near duplicates we define that they are mostly the same sequences of characters.
The objective of a near duplicate detection step is to compute the number of occurences for each paragraph in the text with a relaxed condition that paragraphs can be not exactly the same.
This steps produces duplicate statistics data: mapping from the paragraph 64-bit hash to its occurence count.
Duplicate Statistics Merging
This stage merges the duplicate statistics for several corpora into a single dataset.
Current near-duplicate detection implementation is relatively scalable -- it is O(N log N) where N is the number of paragraphs in the corpus.
Still, it requires significant computational resources if to run it on TB-scale corpora at once.
To make the Uzushio more easily usable on medium-scale systems,
we make it possible to process the parts of corpus,
producing per-part duplicate statistics, merging them into statistics for larger corpora,
and finally using the merged data for the final filtering.
Filtering
The final stage uses the extracted text data and optionally duplicate statistics to filter out low quality documents.
Our implementation can perform the actual filtering in two ways:
- outputting only documents which did not were not rejected by any filter
- outputting all documents, grouping them by the filter they were rejected by
Using the filtering stage in the second mode allows to mix and match documents of different quality,
after the filtering process, making a trade-off between corpus quality with corpus size.
This is a current list of filters:
- Duplicate documents subsampling
- High frequency subsequent paragraph trimming
- Document compression rate
- Hiragana rate
- Document link ratio
- Navigation-like paragraph trimming
- Word types
- Word instances
Debug stages
In addition to the above-mentioned stages, Uzushio also contains several Spark jobs which are used for debugging.
The most important ones are filter-debugging ones.
Most of filters compute some metric and use thresholds to either accept or reject a document or paragraph.
These debug stages compute the metric for all documents and ouptut documents sorted by the metric.
This output can be used for visualization (seeing the metric distribution) and chosing the filtering thresholds.
Work in progress
While the most of the work is done, there are some things which are still in progress.
One is the filtering corpus by n-gram language model.
We plan add two filters: one is for per-document perplexity and one is for per-paragraph perplexity.
Currently Uzushio is focused on Japanese and do not provide rich set of tools for other languages.
We hope that in the future the situation will improve.
Conclusion
We hope that Uzushio will be useful for Japanese and wide NLP community.