Uzushio: Text Extraction

Last updated at 2024-01-26Posted at 2023-12-25

Series

This is an article in a series about Uzushio: Japanese Corpus Preprocessing Tool.

この記事の日本語版は後ほど投稿いたします。

Uzushio: Text Extraction

The first stage of Uzushio is text extraction.
In more detail, it is to produce paragraph-separated text documents from HTML documents.
Our current implementation uses WARC files as input.
In future it would be better to support other input formats, e.g. PDF files, but for now we limit ourselves to HTML documents.

As with other steps, text extraction is
implemented
as an Apache Spark job.

In general, text extraction performs the following steps:

WARC format parsing
Character encoding detection
Language detection
HTML parsing

In the end we produce documents in zstd-compressed parquet format.

WARC Format Parsing

WARC is a gzipped text-based HTTP-like format.
It encapsulates other content, similarly to HTTP body and contains metainformation, similarly to HTTP headers.
WARC files of the Common Crawl project contain crawler requests and responses.
Also, the Common Crawl contains truncated and corrupted archives.

Uzushio uses webarchive-commons library to parse WARC files.
We ignore all non-response documents and try to filter the content of parsed WARC messages to HTTP responses with text-like content.

After that we parse HTTP messages and extract headers, which are used in further steps.

Character Encoding Detection

We use the following algorithm to detect character encoding of the HTTP body.
We try encodings from different sources as specified below.
With each encoding try to decode ~16k bytes of the body and look for unmappable characters.
If an unmappable character exists, we try the next encoding in the list.
If everything fails, we skip the current document.

Sniff the <meta http-equiv="content-type" ...> tag from the body of the document.
If it contains the encoding in the content=... attribute, we use it as a candidate.
Use the value of the Content-Type HTTP header.
Try the juniversalchardet library to guess the encoding of the body. It is a Java version of the library Mozilla uses in the Firefox browser.
Try UTF-8

Language Detection

We use Optimaize language-detector library to detect languages.
It uses character n-gram signatures for language detection.
However, we form the buffer for the language detection in a special way.
We use at most 20 kilobytes of document body to decode at most 4k characters, while skipping all characters with the code lower than 127.
The exception to the rule is to keep a single space in place of any sequence of whitespace characters.
The main idea here is to skip all markup during forming the buffer for language detection.
Our initial version focuses on non-romanic languages, so we skip them during language detection.

HTML Parsing

We parse the content of HTML documents using Apache Tika, which is widely used to extract text content from various documents.
We use its SAX-based parsing API to handle text content.

Basic unit is paragraph

The parsing step outputs a list of document paragraphs.
We define paragraphs based on the structure of the HTML document using tags which are usually block-related, e.g. <p> or <div>.
Content of <br> tags are replaced with \n symbols.
Most of the remaining tags are ignored.

After parsing we replace all consecutive whitespace by a single space character and delete all empty or whitespace-only lines.
We also ignore paragraphs which only consist of whitespace characters or empty.

CSS selectors

For each paragraph we also record the CSS selector for each paragraph.
We record all tags in the parents of the current paragraph-like tag, and record

Tag name,
Tag id if it exist,
Tag classes if they exist.

We use the usual syntax for CSS selectors for recording this information, to be more precise:

Tags are separated with >, the current tag is most-right one, parents go to the left.
Classes, if present, separated with ., both from the tag name and other classes. There can be multiple classes.
Tag ids are separated from tag names or classes with #.
Elements always come in tag.class1.class2#id order, classes and ids are optional.

For example, a selector for a tag can be something like body>div.content#main>div.something.left>div#blog.
There are three four tags, one body and three divs.
The first div tag has a single content class and main id,
the second div tag has two classes: something and left,
the third div tag has blog id.

Links

HTML document contain a lot of links and pages which contain a lot of them are often not useful
for LLM training.
Because of this, we mark text which is link content.
Namely, we surround the text content of <a> tags with 0x02 and 0x03 characters in the extracted text.

Resulting Format

We create the extraction results as a parquet file
with multiple columns.

text stores the document content, the format is described below
docId is the response ID from the WARC file or a random UUID if the header did not exist in the WARC record
url stores the URL of the document as given in WARC file
charset original charset of the document. Text always uses Unicode-compatible encoding as specified by parquet format.
lang detected language of the document
date the minimum value of crawl timestamp, Date HTTP header or Last-Modified HTTP header. Dates before 1999-01-01 are ignored.
We use Zstandard compression for parquet files.

The text field of documents contains all paragraphs concatenated, separated by \n\n: two EOL characters.
Text paragraphs themselves are normalized not to contain more than a single consequent whitespace character.

Within each paragraph, the CSS selector for the paragraph tag is separated from the actual paragraph text with 0x1c character: ASCII field separator character.
Paragraph text content contains link content wrapped within 0x02 and 0x03 characters, as described in the previous paragraph.

Extraction Statistics

The graph above shows the number of extracted Japanese documents from each Common Crawl dump.
While the raw dumps are in 50 TB to 100 TB order, there are very few Japanese documents before the 2016-44 dump.
On average, there are 15.2M Japanese documents per dump.

With regard to data size, extracting plain Japanese text from Common Crawl WARC files yields ~0.1% of the original size.
For this computation both input and output data are compressed: WARC data uses gzip compression and extracted text uses Zstd compression.
The second graph shows ratios of extracted text to the original dump size.
The picture is very similar to the number of extracted documents.

Encodings of Japanese Common Crawl

The final graph shows the distribution of the character encodings of the Japanese Common crawl.
The dominant encoding is UTF-8 and its ratio is growing with time.

Still, there are ~10% documents in legacy encodings, Shift-JIS (or its variant windows-31j or CP932) is the most popular one.
This graph counts Shift-JIS as windows-31j for dumps which have near zero number of documents in Shift-JIS.

As an interesting side note, there were several documents in UTF-32LE encoding in the wild, which we find very surprising, as this encoding is very bad for character data transmission.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up