More than 5 years have passed since last update.

化けたMP3タグ文字を類似度測定ライブラリを使いつつ元に戻した

Posted at 2018-02-11

経緯

jaudiotaggerを使ってMP3タグを読み込んでDBに登録するコードを書いたのですが、文字化けしてるケースがあったので、復元処理を書きました。

考え方

このサイトで試して、こういう事だと理解したんですが、違ったらすいません...

UTF-8でエンコードされた文字列データ(MP3タグ) → UTF-8でデコード(jaudiotagger) → 文字列(Javaの世界。UTF-8)
- OK 正しく読み込める
MS932でエンコードされた文字列 → ISO-8859-1でデコード(バグ？) → 文字列
- NG 文字化けする（UTF-8で表現した場合の見た目に化ける）
- 化けた文字列をISO-8859-1でデコードし、MS932でエンコードすればよい

元々Windowsを使っており、MP3タグエディタも色々試したように思うので、古いタグバージョンのものが紛れていて起こったものと思います。iTunesだと正しく表示されるのですが...

実装

デコード/エンコードのパターンは無数に考えられるが、日本語を文字範囲に含むエンコードに絞れば高々数パターン ^ 2
- 今回はUTF-8、MS932(Windows-31J)、ISO-8859-1を試せば十分でした
デコード/エンコードのパターンでできた文字列のうち、どれが一番正しく変換できているか判定する必要がある
- 今回はMP3タグの解析のため、MP3のファイル名（だいたい曲名）および格納パス（だいたいアルバム名やアーティスト名）に含まれる文字との類似度を計測し、もっとも高いものを採用
  - 引数で渡してるhintsになります
- 類似度判定については、やはりライブラリを使用
  - 今回は判定元の文字が簡単に用意できたからよかったですが、無い場合はどうすると良いでしょうかね...

CharacterEncoder.java

package com.github.yktakaha4.watsonmusic.util;

import java.nio.charset.Charset;
import java.util.Arrays;
import java.util.Collection;
import java.util.Comparator;
import java.util.List;
import java.util.stream.Collectors;

import org.apache.commons.lang3.tuple.Pair;
import org.apache.lucene.search.spell.LevensteinDistance;
import org.apache.lucene.search.spell.StringDistance;
import org.mozilla.universalchardet.UniversalDetector;
import org.springframework.stereotype.Component;

@Component
public class CharacterEncoder {
  private final List<Charset> charsets = Arrays.asList("utf-8", "windows-31j", "iso-8859-1").stream()
      .map(Charset::forName)
      .collect(Collectors.toList());

  public String encodeWithHints(String string, Collection<String> hints) {
    StringDistance dist = new LevensteinDistance();

    System.out.println("original: " + string);
    return charsets.stream().flatMap((decode) -> {
      return charsets.stream().map((encode) -> {
        String encoded = new String(string.getBytes(decode), encode);
        Float distance = hints.stream().map((hint) -> {
          return dist.getDistance(encoded, hint);

        }).max(Comparator.naturalOrder()).orElse(0.0f);
        System.out.println("decode: " + decode + ",encode: " + encode + ", encoded: " + encoded + ",distance: " + distance);

        return Pair.of(encoded, distance);
      });
    }).max((l, r) -> {
      return l.getRight().compareTo(r.getRight());

    }).get().getLeft();
  }

  public String encode(String string) {
    byte[] source = string.getBytes();
    UniversalDetector universalDetector = new UniversalDetector(null);
    universalDetector.handleData(source, 0, source.length);

    String charsetName = universalDetector.getDetectedCharset();
    if (charsetName != null) {
      return new String(source, Charset.forName(charsetName));

    } else {
      return string;

    }

  }

}

処理結果

original: ³© ªè~Toothache and Chocolate~
decode: UTF-8,encode: UTF-8, encoded: ³© ªè~Toothache and Chocolate~,distance: 0.64102566
decode: UTF-8,encode: windows-31j, encoded: ﾂつｳﾂつｩﾂつ�ﾂつｪﾂづｨ~Toothache and Chocolate~,distance: 0.525
decode: UTF-8,encode: ISO-8859-1, encoded: ÂÂ³ÂÂ©ÂÂ ÂÂªÂÃ¨~Toothache and Chocolate~,distance: 0.46666664
decode: windows-31j,encode: UTF-8, encoded: ??????????~Toothache and Chocolate~,distance: 0.64102566
decode: windows-31j,encode: windows-31j, encoded: ??????????~Toothache and Chocolate~,distance: 0.64102566
decode: windows-31j,encode: ISO-8859-1, encoded: ??????????~Toothache and Chocolate~,distance: 0.64102566
decode: ISO-8859-1,encode: UTF-8, encoded: ����������~Toothache and Chocolate~,distance: 0.64102566
decode: ISO-8859-1,encode: windows-31j, encoded: さかあがり~Toothache and Chocolate~,distance: 0.7692308
decode: ISO-8859-1,encode: ISO-8859-1, encoded: ³© ªè~Toothache and Chocolate~,distance: 0.64102566

encode関数の方は関係ないので無視してください。

まとめ

StreamAPIが使えたところが気持ちよかった（小並）

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up