Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

This article is a Private article. Only a writer and users who know the URL can access it.
Please change open range to public in publish setting if you want to share this article with other users.

More than 1 year has passed since last update.

Don't be Afraid of Gigabytes of Text

Posted at

Requirements

I receive a huge fixed length text file from an IBM host server. For example, 20 million contract data, 1 record 1000 bytes, total size 18.6 GB. This file will be divided into 50,000 CSV files according to the organization in charge. In this article, I would like to introduce what kind of bag or monster you will encounter in such an unusual process. *The program below is how to write an efw js event.

Sample Data

Since it is difficult to manually create test data that meets the requirements, the character code is set to MS932 instead of IBMCp930 and Cp939, the record size is set to 20 bytes instead of 1000 bytes, and the number of records is set to 100 so that the sample can be handled in a Windows environment. Let's explore different ways to do this using this sample file.
image.png
The sample file is small so you can open it in an editor. 10 bytes are the ID, 10 bytes are the name, and so on.

Example 1: An innocent example without considering various restrictions

var ary=new BinaryReader(
    "text&csv/myText.txt",//読み取るファイル
    [10,10],//項目ごとのバイト数
    ["MS932","MS932"],//項目ごとの文字コード
    20//1つレイアウトのバイト数
).readAllLines();//全部レコードを一括で読み取る
for(var i=0;i<ary.length;i++){
    //IDで保存先を特定する。
    var writer= new CSVWriter("text&csv/seperated/"+ary[i][0]+".csv", ",", "\"", "MS932");
    writer.writeLine(ary[i]);//レコードを書き込む
    writer.close();
}

The problem with this example is readAllLines. Since the entire data file is brought into memory at once, there is a risk of memory overflow when the data file size reaches gigabytes. Granted, 20 million rows still doesn't exceed Java's array limit, and a great server could (maybe) store it in memory, but let's be craftsmen and improve! *Can programmers be called craftsmen?

Example 2: A cautious example of processing one case at a time

new BinaryReader(
    "text&csv/myText.txt",//読み取るファイル
    [10,10],//項目ごとのバイト数
    ["MS932","MS932"],//項目ごとの文字コード
    20//1つレイアウトのバイト数
).loopAllLines(function(fields,index){//全部レコードを1件ずつ読み取る
    //IDで保存先を特定する。
    var writer= new CSVWriter("text&csv/seperated/"+fields[0]+".csv", ",", "\"", "MS932");
    writer.writeLine(fields);//レコードを書き込む
    writer.close();
});

Read one line at a time with loopAllLines and write each line to a file. Memory pressure will never occur. However, since reading and writing occur together, hard disk IO is likely to become a bottleneck.

Example 3: Example of dividing IO by lot

The word "lot" refers to the production of large quantities of steel of the same quality in one go. I borrow that meaning and use it. The number of lots is set to 10 in the source below, but the number of lots in the actual program is 200,000 lines, or about 200MB in size.

var buffer=[];//ロット処理のバッファー
new BinaryReader(
    "text&csv/myText.txt",//読み取るファイル
    [10,10],//項目ごとのバイト数
    ["MS932","MS932"],//項目ごとの文字コード
    20//1つレイアウトのバイト数
).loopAllLines(function(fields,index){//全部レコードを1件ずつ読み取る
    buffer.push(fields);
    if (index % 10 == 0){//ロット数に達すかどうか判断
        saveBuffer();//ロットを保存する
    }
});
saveBuffer();//ロット数未満の残データを保存する
//------以下はバッファー保存用の内部関数
function saveBuffer(){
    for (var i=0;i<buffer.length;i++){
        //IDで保存先を特定する。
        var writer= new CSVWriter("text&csv/seperated/"+buffer[i][0]+".csv", ",", "\"", "MS932");
        writer.writeLine(buffer[i]);//レコードを書き込む
        writer.close();
    }
    buffer=[];//バッファーを初期化する
}

Separation of reading and writing can be achieved. However, when writing, only one line of data is saved when opening and closing the file once. It's such a waste.

Example 4: Example of reusing lighters

Based on Example 3, consider duplicate usage of writers.

var buffer=[];//ロット処理のバッファー
var writers={};//ライターを格納するマップ
new BinaryReader(
    "text&csv/myText.txt",//読み取るファイル
    [10,10],//項目ごとのバイト数
    ["MS932","MS932"],//項目ごとの文字コード
    20//1つレイアウトのバイト数
).loopAllLines(function(fields,index){//全部レコードを1件ずつ読み取る
    buffer.push(fields);
    if (index % 10 == 0){//ロット数に達すかどうか判断
        saveBuffer();//ロットを保存する
    }
});
saveBuffer();//ロット数未満の残データを保存する
saveWriters();//ライターを一括で閉じる
//------以下はバッファー保存用の内部関数
function saveBuffer(){
    for (var i=0;i<buffer.length;i++){
        //IDで保存先を特定する。
        var writer=writers[buffer[i][0]];
        if (writer==null){
            writer=new CSVWriter("text&csv/seperated/"+buffer[i][0]+".csv", ",", "\"", "MS932");
            writers[buffer[i][0]]=writer;
        }
        writer.writeLine(buffer[i]);//レコードを書き込む
    }
    buffer=[];//バッファーを初期化する
}
//--------ライターを一括で閉じる関数
function saveWriters(){
    for(var key in writers){
        if (key=="debug")continue;
        writers[key].close();
    }
}

One writer can be used to store multiple lines of data, increasing efficiency. However, when I calculated it, there were 50,000 records for each organization, so at the end of the process, 50,000 CSV files would be opened at the same time, which seems dangerous. I haven't found a specific limit on the number of files, but I found through testing that it can open several thousand files at the same time.

Example 5, Example of dividing buffer array by ID

You can't do it without opening the writer, so split the buffer array to make it easier to save.

var buffer={};//ロット処理のバッファーマップ、ID別の配列を格納する
new BinaryReader(
    "text&csv/myText.txt",//読み取るファイル
    [10,10],//項目ごとのバイト数
    ["MS932","MS932"],//項目ごとの文字コード
    20//1つレイアウトのバイト数
).loopAllLines(function(fields,index){//全部レコードを1件ずつ読み取る
    //もしID別の配列がまだ存在しない場合、その配列を初期化する
    if (buffer[fields[0]]==null)buffer[fields[0]]=[];
    buffer[fields[0]].push(fields);
    if (index % 10 == 0){//ロット数に達すかどうか判断
        saveBuffer();//ロットを保存する
    }
});
saveBuffer();//ロット数未満の残データを保存する
//------以下はバッファー保存用の内部関数
function saveBuffer(){
    for (var key in buffer){
        if (key=="debug")continue;
        var ary=buffer[key];
        var writer=new CSVWriter("text&csv/seperated/"+key+".csv", ",", "\"", "MS932");
        for(var i=0;i<ary.length;i++){
             writer.writeLine(ary[i]);//レコードを書き込む
        }
        writer.close();
    }
    buffer={};//バッファーを初期化する
}

This solves the problem of the number of files being opened at the same time. There are no particular problems. This is fine. However, with the spirit of craftsmanship, we will continue to make further improvements (or worse?).

Example 6, multi-threaded example

Try loading in multiple threads.

var buffer={};//ロット処理のバッファーマップ、ID別の配列を格納する
var hasDataFlag=false;//データ有無フラグ
var lot=0;
do{
    hasDataFlag=false;//初期値false
    var threads = new Threads(2);
    threads.add({from:0+lot*10 ,run:makeCsvBuffer});
    threads.add({from:5+lot*10 ,run:makeCsvBuffer});
    threads.run();//マルチスレッドを実行する
    saveBuffer();//バッファーを保存する。データある場合、hasDataFlagをtrueにする
    lot++;
}while(hasDataFlag);
//------以下はCSVバッファーを作成する関数
function makeCsvBuffer(){
    new BinaryReader(
        "text&csv/myText.txt",//読み取るファイル
        [10,10],//項目ごとのバイト数
        ["MS932","MS932"],//項目ごとの文字コード
        20,//1つレイアウトのバイト数
        this.from,//読み込み開始レコード番号
        5//読み込み件数、ロット件数/スレッド数
    ).loopAllLines(function(fields,index){//全部レコードを1件ずつ読み取る
        //もしID別の配列がまだ存在しない場合、その配列を初期化する
        helloTextCSVThread_submit.mylocker.lock();//ロックする
            if (buffer[fields[0]]==null)buffer[fields[0]]=[];
            buffer[fields[0]].push(fields);
        helloTextCSVThread_submit.mylocker.unlock();//ロック解除する
    });
}
//------以下はバッファー保存用の内部関数
function saveBuffer(){
    for (var key in buffer){
        if (key=="debug")continue;
        var ary=buffer[key];
        var writer=new CSVWriter("text&csv/seperated/"+key+".csv", ",", "\"", "MS932");
        for(var i=0;i<ary.length;i++){
             writer.writeLine(ary[i]);//レコードを書き込む
        }
        writer.close();
        hasDataFlag=true;
    }
    buffer={};//バッファーを初期化する
}

Since we are manipulating the buffer variable in multithreads, we use a locker to synchronize it. Otherwise, if thread A is in the middle of adding an array to buffer, and thread B determines that there is no key and still performs an operation to add the array, thread A's additional data will be gone. This effect can also be seen in two threads.

As a result of testing on an actual project, the time reduction effect is only about 10% compared to Example 5. The reasons for the speculation are as follows.
・The loading process is fast.
・Multithreading is ruined due to synchronization when manipulating buffers.
・The program at that time also had a DB import function. Since that proportion is quite large, it seems that multi-threading for reading is less effective than expected.

It's better than nothing, but it's at a poor level considering maintainability.

This sample can be downloaded from the link below.

This is the jar file to use.

<dependency>
    <groupId>io.github.efwgrp</groupId>
    <artifactId>efw</artifactId>
    <version>4.07.000</version>
</dependency>

For jdk15 or higher, related jars are required.

<dependency>
    <groupId>org.openjdk.nashorn</groupId>
    <artifactId>nashorn-core</artifactId>
    <version>15.4</version>
</dependency>
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?