More than 5 years have passed since last update.

Ruby : Nokogiriが自分でエンコーディングを参照しに行く場所を調べてみた

Posted at 2020-05-27

はじめに

こちらの記事で、Nokogiriはエンコーディング指定がnilだったときに、パース元のhtmlのmeta要素のcharsetを見に行くと結論づけました。
今回は結論が本当にあっているか確かめるために、公式ドキュメントを追ってみた話です。

公式ドキュメントを追ってみる

Nokogiri公式ドキュメント
今回はこちらの公式ドキュメントを追っていきます。もちろん英語です。
普段自分は英語の公式ドキュメントは避けがちなのですが、決心して見に行ってみます。英語は読めなくてもコードなら読めます。多分。

Nokogiri::HTML::Documentクラス

普通Nokogiriを使ってパースするときにはNokogiri::HTML.parse(html)のように書いているのですが、正式にはNokogiri::HTML::Documentクラスのようです。
Documentクラス欄を開いて.parseメソッドを探し、「view source」でソースを表示してみます。

以下ソース

lib/nokogiri/html/document.rb

def parse string_or_io, url = nil, encoding = nil, options = XML::ParseOptions::DEFAULT_HTML

  options = Nokogiri::XML::ParseOptions.new(options) if Integer === options
  # Give the options to the user
  yield options if block_given?

  if string_or_io.respond_to?(:encoding)
    unless string_or_io.encoding.name == "ASCII-8BIT"
      encoding ||= string_or_io.encoding.name
    end
  end

  if string_or_io.respond_to?(:read)
    url ||= string_or_io.respond_to?(:path) ? string_or_io.path : nil
    unless encoding
      # Libxml2's parser has poor support for encoding
      # detection.  First, it does not recognize the HTML5
      # style meta charset declaration.  Secondly, even if it
      # successfully detects an encoding hint, it does not
      # re-decode or re-parse the preceding part which may be
      # garbled.
      #
      # EncodingReader aims to perform advanced encoding
      # detection beyond what Libxml2 does, and to emulate
      # rewinding of a stream and make Libxml2 redo parsing
      # from the start when an encoding hint is found.
      string_or_io = EncodingReader.new(string_or_io)
      begin
        return read_io(string_or_io, url, encoding, options.to_i)
      rescue EncodingFound => e
        encoding = e.found_encoding
      end
    end
    return read_io(string_or_io, url, encoding, options.to_i)
  end

  # read_memory pukes on empty docs
  if string_or_io.nil? or string_or_io.empty?
    return encoding ? new.tap { |i| i.encoding = encoding } : new
  end

  encoding ||= EncodingReader.detect_encoding(string_or_io)

  read_memory(string_or_io, url, encoding, options.to_i)
end

まずここに注目してみます

lib/nokogiri/html/document.rb

  if string_or_io.respond_to?(:encoding)
    unless string_or_io.encoding.name == "ASCII-8BIT"
      encoding ||= string_or_io.encoding.name
    end
  end

string_or_ioは普段自分がhtmlを指定している変数です。
解釈して見るに、string_or_ioがencodingメソッドを持っており、そのエンコーディング名がASCII-8BITでなく、encoding引数が定義されていなかったら、encodingはstring_or_ioのエンコーディング名とする、のようです。

なるほど！だからhtmlをバイナリモードで開かなかった場合には、htmlの開き方（エンコーディング）に依存してしまうので、パース後に文字化けが発生してしまうことがあるということだったんですね！

ではファイルをバイナリモードで開き、かつencoding引数がnilのときにはどうなるのか。
今度はここに注目してみます。

lib/nokogiri/html/document.rb

encoding ||= EncodingReader.detect_encoding(string_or_io)

encoding引数が定義されていなかったら、EncodingReader.detect_encodingメソッドを使うとあります。
おとなしくドキュメントのEncodingReader.detect_encodingメソッドを見に行きます。

先ほどと同じようにソースを表示します。
以下ソース

lib/nokogiri/html/document.rb

def self.detect_encoding(chunk)
  if Nokogiri.jruby? && EncodingReader.is_jruby_without_fix?
    return EncodingReader.detect_encoding_for_jruby_without_fix(chunk)
  end
  m = chunk.match(/\A(<\?xml[ \t\r\n]+[^>]*>)/) and
    return Nokogiri.XML(m[1]).encoding

  if Nokogiri.jruby?
    m = chunk.match(/(<meta\s)(.*)(charset\s*=\s*([\w-]+))(.*)/i) and
      return m[4]
    catch(:encoding_found) {
      Nokogiri::HTML::SAX::Parser.new(JumpSAXHandler.new(:encoding_found)).parse(chunk)
      nil
    }
  else
    handler = SAXHandler.new
    parser = Nokogiri::HTML::SAX::PushParser.new(handler)
    parser << chunk rescue Nokogiri::SyntaxError
    handler.encoding
  end
end

メソッド引数のchunkに、今回はstring_or_io、つまり普段自分がhtmlとしているものが入ります。

見慣れないメソッドが多いのでぱっと正確な意味がとれないですが、2番目のifブロックにmetaのcharsetを参照しているような記述がありませんか？？？
returnで値を返しているようですし、この箇所が非常に怪しい感じがします。

さいごに

まだソースの細かい箇所が把握できていませんが、自分の求める答えにぐっと近づけた気がします。
ソースの詳細がわかれば、また別記事でまとめようと思います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up