Avro仕様書を読んでみること #Kafka

要約

CDC Replication Engine for KafkaというKafka用ツール製品が自動で登録してくれたAvro Schemaに気になるところがあったのでApache Avro™ 1.10.2 Specificationを見ながら解読を試みた際のメモ。
最初に言っておくと、Avroスキーマの読み方は勉強できたのですが結局気になったところは完全には解消していないです。

発端

メインフレーム上にあるIMS階層型DBをKafkaへレプリケーションするという検証をやっています。
IMSからの吸い上げはInfosphere Classic CDC for z/OS(以下CCDC)、Kafkaへの投入にはCDC Replication Engine for Kafka(以下CDC for Kafka)を使っています。

CCDCにDBレコードの論理的な見方、もといデータ構造をメタデータという形で与えてやるとCDC for Kafkaが自動的に投入先トピックを作成してメタデータに対応したデータ構造をAvro形式でスキーマへ登録してくれるのが便利。
一通りのデータ型についてちゃんとメタデータ生成からAvroスキーマ登録、レプリケーションまでの一連の流れがうまくいくかどうかテストしていたのですが、その中で気になったことがありました。

バイナリデータのレプリケーションも1バイト・2バイト・4バイトでそれぞれ試したのですが、ソース(レプリケーション元IMS)では0xf0だった箇所をKafka側でavro-consumeしてみるとdになっていたんですね。そして0xfff0がydに、0xfffffff0がyyydになっていました。

0xffがy、0xf0がdになっているようだというのは分かるのですが、yはdの15番先というわけでもなく、なぜこうなっているのかわからない。
そこでまずは腰を据えてCDC for Kafkaに任せきりにしていたAvroスキーマへの登録をちゃんと見てみることにしたのでした。

調査対象Avroスキーマ

Avroスキーマ

{
    "type": "record",
    "name": "RTSEG2",
    "namespace": "value.SOURCEDB.IMS.HICKA1",
    "fields": [
        {
        "name": "RT_KEY",
        "type": [
            {
            "type": "string",
            "logicalType": "CHARACTER",
            "dbColumnName": "RT_KEY",
            "length": 6
            },
            "null"
        ],
        "doc": "",
        "default": ""
        },
        {
        "name": "RT_BIT8",
        "type": [
            {
            "type": "bytes",
            "logicalType": "BINARY",
            "dbColumnName": "RT_BIT8",
            "length": 1
            },
            "null"
        ],
        "doc": "",
        "default": ""
        },
        {
        "name": "RT_BIT16",
        "type": [
            {
            "type": "bytes",
            "logicalType": "BINARY",
            "dbColumnName": "RT_BIT16",
            "length": 2
            },
            "null"
        ],
        "doc": "",
        "default": ""
        },
        {
        "name": "RT_BIT32",
        "type": [
            {
            "type": "bytes",
            "logicalType": "BINARY",
            "dbColumnName": "RT_BIT32",
            "length": 4
            },
            "null"
        ],
        "doc": "",
        "default": ""
        }
    ]
}

Avroフォーマット解読

今回レプリケーションしたトピックのAvroスキーマは下記の記述ではじまります。

再掲Avroスキーマ冒頭部分

{
    "type": "record",
    "name": "RTSEG2",
    "namespace": "value.SOURCEDB.IMS.HICKA1",
    "fields": [

注目すべきは"type": "record"の部分で、これによってこのスキーマはAvro仕様書の記載でいうComplex Types、

Complex Types

Avro supports six kinds of complex types: records, enums, arrays, maps, unions and fixed.

Records

Records use the type name "record" and support three attributes:
...

に該当することがわかります。Recordsタイプの説明を見ると

fields: a JSON array, listing fields (required). Each field is a JSON object with the following attributes:

とあるので、今回送った各フィールド定義は"fields":以降のカッコの中にJSONオブジェクトとして記述されているものと考えて読み解く必要がありそうです。
更にfieldsについて説明を見ると

name: a JSON string providing the name of the field (required), and
doc: a JSON string describing this field for users (optional).
type: a schema, as defined above
default: A default value for this field, only used when reading instances that lack the field for schema evolution purposes. The presence of a default value does not make the field optional at encoding time. Permitted values depend on the field's schema type, according to the table below. Default values for union fields correspond to the first schema in the union. Default values for bytes and fixed fields are JSON strings, where Unicode code points 0-255 are mapped to unsigned 8-bit byte values 0-255. Avro encodes a field even if its value is equal to its default.

とあります。

再掲Avroスキーマバイナリ部分

        {
        "name": "RT_BIT8",
        "type": [
            {
            "type": "bytes",
            "logicalType": "BINARY",
            "dbColumnName": "RT_BIT8",
            "length": 1
            },
            "null"
        ],
        "doc": "",
        "default": ""
        },

nameは"RT_BIT8"、CCDC側にてメタデータで与えたフィールド名がそのまま入っています。docとdefaultはなし。

typeには[{...},"null"]と大カッコに囲まれて2つ値が入っています。これはどうやらUnionという形式。

Unions

Unions, as mentioned above, are represented using JSON arrays. For example, ["null", "string"] declares a schema which may be either a null or string.

更に2つの値のうち前者はJSON形式で記述されています。このスキーマの記述方法としてAvro仕様書には下記のパターンが示されています。

Schema Declaration

A Schema is represented in JSON by one of:
- A JSON string, naming a defined type.
- A JSON object, of the form:
{"type": "typeName" ...attributes...}
where typeName is either a primitive or derived type name, as defined below. Attributes not defined in this document are permitted as metadata, but must not affect the format of serialized data.
- A JSON array, representing a union of embedded types.

この第2項："A JSON object"形式で記述されているのでしょう。typeNameとして指定しているのがbytes。
typeとしてのbytesの説明は

Primitive Types

The set of primitive type names is:
...
- bytes: sequence of 8-bit unsigned bytes

となっています。
そして以降の"logicalType": "BINARY"だとか"length": 1だとかはattributesにあたるようです。

logicalTypeについては

Logical Types

A logical type is an Avro primitive or complex type with extra attributes to represent a derived type. The attribute logicalType must always be present for a logical type, and is a string with the name of one of the logical types listed later in this section. Other attributes may be defined for particular logical types.
A logical type is always serialized using its underlying Avro type so that values are encoded in exactly the same way as the equivalent Avro type that does not have a logicalType attribute. Language implementations may choose to represent logical types with an appropriate native type, although this is not required.
Language implementations must ignore unknown logical types when reading, and should use the underlying Avro type. If a logical type is invalid, for example a decimal with scale greater than its precision, then implementations should ignore the logical type and use the underlying Avro type.

「primitive typeまたはcomplex typeに追加の属性をつけるものである」
「このセクションの下にある一覧に載ってるいずれかを文字列で指定したものである」
「エンコードは常にその基礎となるprimitive typeと同じように行われる」
「無効なlogicalType指定は無視される」
ということが書かれているのですが、CDC for Kafkaが生成したlogicalTypeであるBINARYはこの仕様書に記載の一覧には載っていません。
しかしこのlogicalTypeが有効であるにせ無効であるにせ、primitive typeであるところのbytesとしてエンコードされるものと考えてよさそうです。

エンコード解読

バイナリフィールドはbytes型で扱われていることが確認できたので、エンコーディングの説明を読んでdだのydだのの謎を追っていきます。

Complex Typesのエンコードについては以下のように書かれています。

Complex Types

Complex types are encoded in binary as follows:

Records

A record is encoded by encoding the values of its fields in the order that they are declared. In other words, a record is encoded as just the concatenation of the encodings of its fields. Field values are encoded per their schema.
For example, the record schema
          {
          "type": "record",
          "name": "test",
          "fields" : [
          {"name": "a", "type": "long"},
          {"name": "b", "type": "string"}
          ]
          }
An instance of this record whose a field has value 27 (encoded as hex 36) and whose b field has value "foo" (encoded as hex bytes 06 66 6f 6f), would be encoded simply as the concatenation of these, namely the hex byte sequence:
36 06 66 6f 6f

各フィールドの型に従ってエンコードしたものをただただ書き連ねていくだけのよう。つまりJSON形式のスキーマに含まれている"length": 1はデコード時に使うというわけではないようです。

bytes型やstring型など長さが決まっていないものについては先頭に1バイトのフィールド長ヘッダを持つようです。

bytes are encoded as a long followed by that many bytes of data.

a string is encoded as a long followed by that many bytes of UTF-8 encoded character data.

For example, the three-character string "foo" would be encoded as the long value 3 (encoded as hex 06) followed by the UTF-8 encoding of 'f', 'o', and 'o' (the hex bytes 66 6f 6f):
06 66 6f 6f

3バイトのstringデータ"foo"はそれぞれUTF-8に直すと66 6f 6f、その前に3を表すlong値06をつけて06 66 6f 6fになる、と。long型と言っていますがこのサイトに載ってる例ではいずれも1バイトの値になっています。

ひとつ上のlong型フィールドとstring型フィールドが続く例だと36 06 66 6f 6fという実データに対して最初はlong型なので1バイト、次はstring型なので先頭1バイトを見てデータ長を把握、データ長3バイトだとわかったので続く3バイトをstringフィールドとしてデコード…とやっていくのでしょう。

実データ解読

さて、エンコード/デコードの規則がわかったので私もavro-consumerと同じように生データを解釈できるようになりました。
avroじゃないconsole-consumerで生データを抜き、xxdにパイプして16進表記をつけてみます。

00000000: 0000 0000 0e00 0c30 3030 3030 3100 1454  .......000001..T
00000010: 6869 7349 7352 4f4f 5400 3cef bdb1 efbd  hisIsROOT.<.....
(中略)
00000110: 6421 0002 f000 04ff f000 08ff ffff f00a  d!..............

途中に他のデータ型のテストをしていた部分が結構あるので中略とさせていただいています。

1バイト・2バイト・4バイトのバイナリデータに該当すると思われる箇所は確かに存在していました。0002 f000 04ff f000 08ff ffff f00a。
生データとしてはそれぞれf0、fff0、fffffff0が入っていて、フィールド長ヘッダもそれぞれ02、04、08とついている。素敵に正しい。各フィールドの間には必ず00が入っているようです。

つまり生データレベルでは適切で、avro-consumerによる出力時にdだのyだの化けているようです。…ということまでは分かったのですが…結局なんでdやyになったのかは謎のまま。

でもとりあえずCDC for Kafkaとしてはちゃんと正しくデータを入れてくれているということが分かり製品検証としての目的は果たせたため、ここで一旦打ち切りとしました。

変更履歴

21-11-04
- 各フィールドのtypeはUnion形式で指定されているのではないかと@tomotagworkさんより指摘いただき、本文に反映しました。