More than 5 years have passed since last update.

[C++] コンパイル時UTF-8文字列長計算

Last updated at 2018-08-10Posted at 2018-08-10

概要

constexprの復習用に作った。

作った本人でさえ何の役に立つのか分かっていない。

実装

C++14以上でコンパイル可能。BOMとかは一切考慮していない。

utf8_strlen.hpp

# include <cstddef>
# include <cstdint>
# include <stdexcept>

constexpr size_t utf8_strlen(const char* str) {
  size_t count = 0;

  while (*str != '\0') {
    uint8_t lead = static_cast<uint8_t>(*(str++));
    ptrdiff_t secondary_chars =
        (lead < 0x80) ? 0 :
        ((lead >> 5) == 0b110) ? 1 :
        ((lead >> 4) == 0b1110) ? 2 :
        ((lead >> 3) == 0b11110) ? 3 :
        throw std::out_of_range("invalid UTF code point was detected.");

    while (secondary_chars) {
      if (-65 < *(str++)) {
        throw std::out_of_range("invalid UTF code point was detected.");
      }

      --secondary_chars;
    }

    ++count;
  }

  return count;
}

使い方

main.cpp

# include <cstdio>
# include "utf8_strlen.hpp"

int main() {
  constexpr const char str[] = u8"UTF-8だよ〜ん";
  constexpr size_t length = utf8_strlen(str);

  printf("%lu\n", length);
  return 0;
}

$ g++ -std=c++14 -o main main.cpp
$ ./main
9

ここで注意しなければならないのは、引数が有効なUTF-8文字列でなければならないということだ。

ただし、大概のコンパイラはコンパイル時に例外に引っかかるとエラーを吐いてくれるようだ。自分の環境では、g++(8.0.1)およびclang++(6.0.0)がともにエラーを吐いてくれた。

さいごに

気が向いたらBOMを考慮したバージョンも作ります。

参考

追記 (2018/8/10)

BOMありバージョン

# include <cstddef>
# include <cstdint>
# include <stdexcept>

constexpr size_t utf8_strlen(const char* str) {
  size_t count = 0;

  // chark if str starts with bom
  if (str[0] == -17 && str[1] == -69 && str[2] == -65) {
    str += 3;
  }

  while (*str != '\0') {
    uint8_t lead = static_cast<uint8_t>(*(str++));
    ptrdiff_t secondary_chars =
        (lead < 0x80) ? 0 :
        ((lead >> 5) == 0b110) ? 1 :
        ((lead >> 4) == 0b1110) ? 2 :
        ((lead >> 3) == 0b11110) ? 3 :
        throw std::out_of_range("invalid UTF code point was detected.");

    while (secondary_chars) {
      if (-65 < *(str++)) {
        throw std::out_of_range("invalid UTF code point was detected.");
      }

      --secondary_chars;
    }

    ++count;
  }

  return count;
}

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up