More than 5 years have passed since last update.

Ude.NetStandard を利用したテキストエンコーディングの判別

.NETFramework

Posted at 2017-04-20

テキストファイルを読み込む際に、テキストエンコーディングを判別する。

前提

NuGet で、Ude.NetStandard をプロジェクトにインストールしておく。
Ude.NetStandard は Mozilla Universal Charset Detector の .NET 移植版。

ほかにも NUniversalCharDet というのがあるけれど、同じ作者さんで新しいほうを採用。

パッケージマネージャからだと以下でインストール。

PM> Install-Package Ude.NetStandard

ちなみにボクはマウスでぽちぽししました。

コード

static string detectCharset(Stream stream)
{
    var charsetDetector = new Ude.CharsetDetector();
    //charsetDetector.Reset();
    charsetDetector.Feed(stream);
    charsetDetector.DataEnd();

    return charsetDetector.Charset;
}

stream が進んでしまうのをもとに戻していないので、実際に stream を読む前に巻き戻さないといけない。

バイト配列を食わせていきたかったら、以下。

static string detectCharset(Stream stream)
{
    var buf = new byte[1024];
    var charsetDetector = new Ude.CharsetDetector();
    //charsetDetector.Reset();
    do {
        int readsize = stream.Read(buf, 0, buf.Length);
        if (readsize == 0) {
            break;
        }
        charsetDetector.Feed(buf, 0, buf.Length);
    } while (charsetDetector.IsDone() == false);
    charsetDetector.DataEnd();

    return charsetDetector.Charset;
}

ほかにも

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up