Series
This is an article in a series about Uzushio: Japanese Corpus Preprocessing Tool.
この記事の日本語版は後ほど投稿いたします。
Uzushio: Text Extraction
The first stage of Uzushio is text extraction.
In more detail, it is to produce paragraph-separated text documents from HTML documents.
Our current implementation uses WARC files as input.
In future it would be better to support other input formats, e.g. PDF files, but for now we limit ourselves to HTML documents.
As with other steps, text extraction is
implemented
as an Apache Spark job.
In general, text extraction performs the following steps:
- WARC format parsing
- Character encoding detection
- Language detection
- HTML parsing
In the end we produce documents in zstd-compressed parquet format.
WARC Format Parsing
WARC is a gzipped text-based HTTP-like format.
It encapsulates other content, similarly to HTTP body and contains metainformation, similarly to HTTP headers.
WARC files of the Common Crawl project contain crawler requests and responses.
Also, the Common Crawl contains truncated and corrupted archives.
Uzushio uses webarchive-commons library to parse WARC files.
We ignore all non-response documents and try to filter the content of parsed WARC messages to HTTP responses with text-like content.
After that we parse HTTP messages and extract headers, which are used in further steps.
Character Encoding Detection
We use the following algorithm to detect character encoding of the HTTP body.
We try encodings from different sources as specified below.
With each encoding try to decode ~16k bytes of the body and look for unmappable characters.
If an unmappable character exists, we try the next encoding in the list.
If everything fails, we skip the current document.
- Sniff the
<meta http-equiv="content-type" ...>
tag from the body of the document.
If it contains the encoding in thecontent=...
attribute, we use it as a candidate. - Use the value of the
Content-Type
HTTP header. - Try the juniversalchardet library to guess the encoding of the body. It is a Java version of the library Mozilla uses in the Firefox browser.
- Try UTF-8
Language Detection
We use Optimaize language-detector library to detect languages.
It uses character n-gram signatures for language detection.
However, we form the buffer for the language detection in a special way.
We use at most 20 kilobytes of document body to decode at most 4k characters, while skipping all characters with the code lower than 127.
The exception to the rule is to keep a single space in place of any sequence of whitespace characters.
The main idea here is to skip all markup during forming the buffer for language detection.
Our initial version focuses on non-romanic languages, so we skip them during language detection.
HTML Parsing
We parse the content of HTML documents using Apache Tika, which is widely used to extract text content from various documents.
We use its SAX-based parsing API to handle text content.
Basic unit is paragraph
The parsing step outputs a list of document paragraphs.
We define paragraphs based on the structure of the HTML document using tags which are usually block-related, e.g. <p>
or <div>
.
Content of <br>
tags are replaced with \n
symbols.
Most of the remaining tags are ignored.
After parsing we replace all consecutive whitespace by a single space character and delete all empty or whitespace-only lines.
We also ignore paragraphs which only consist of whitespace characters or empty.
CSS selectors
For each paragraph we also record the CSS selector for each paragraph.
We record all tags in the parents of the current paragraph-like tag, and record
- Tag name,
- Tag id if it exist,
- Tag classes if they exist.
We use the usual syntax for CSS selectors for recording this information, to be more precise:
- Tags are separated with
>
, the current tag is most-right one, parents go to the left. - Classes, if present, separated with
.
, both from the tag name and other classes. There can be multiple classes. - Tag ids are separated from tag names or classes with
#
. - Elements always come in
tag.class1.class2#id
order, classes and ids are optional.
For example, a selector for a tag can be something like body>div.content#main>div.something.left>div#blog
.
There are three four tags, one body
and three div
s.
The first div
tag has a single content
class and main
id,
the second div
tag has two classes: something
and left
,
the third div
tag has blog
id.
Links
HTML document contain a lot of links and pages which contain a lot of them are often not useful
for LLM training.
Because of this, we mark text which is link content.
Namely, we surround the text content of <a>
tags with 0x02
and 0x03
characters in the extracted text.
Resulting Format
We create the extraction results as a parquet file
with multiple columns.
-
text
stores the document content, the format is described below -
docId
is the response ID from the WARC file or a random UUID if the header did not exist in the WARC record -
url
stores the URL of the document as given in WARC file -
charset
original charset of the document. Text always uses Unicode-compatible encoding as specified by parquet format. -
lang
detected language of the document -
date
the minimum value of crawl timestamp,Date
HTTP header orLast-Modified
HTTP header. Dates before 1999-01-01 are ignored.
We use Zstandard compression for parquet files.
The text
field of documents contains all paragraphs concatenated, separated by \n\n
: two EOL
characters.
Text paragraphs themselves are normalized not to contain more than a single consequent whitespace character.
Within each paragraph, the CSS selector for the paragraph tag is separated from the actual paragraph text with 0x1c
character: ASCII field separator character.
Paragraph text content contains link content wrapped within 0x02
and 0x03
characters, as described in the previous paragraph.
Extraction Statistics
The graph above shows the number of extracted Japanese documents from each Common Crawl dump.
While the raw dumps are in 50 TB to 100 TB order, there are very few Japanese documents before the 2016-44 dump.
On average, there are 15.2M Japanese documents per dump.
With regard to data size, extracting plain Japanese text from Common Crawl WARC files yields ~0.1% of the original size.
For this computation both input and output data are compressed: WARC data uses gzip compression and extracted text uses Zstd compression.
The second graph shows ratios of extracted text to the original dump size.
The picture is very similar to the number of extracted documents.
Encodings of Japanese Common Crawl
The final graph shows the distribution of the character encodings of the Japanese Common crawl.
The dominant encoding is UTF-8 and its ratio is growing with time.
Still, there are ~10% documents in legacy encodings, Shift-JIS (or its variant windows-31j or CP932) is the most popular one.
This graph counts Shift-JIS as windows-31j for dumps which have near zero number of documents in Shift-JIS.
As an interesting side note, there were several documents in UTF-32LE encoding in the wild, which we find very surprising, as this encoding is very bad for character data transmission.