4
3

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

Glueの使い方的な㊻(update-table)

Last updated at Posted at 2020-06-10

メモ書き

get-table

テーブルtmp_logsの情報を get-table API で取得

$ aws glue get-table --database-name default --name tmp_logs --region ap-northeast-1
{
    "Table": {
        "Name": "tmp_logs",
        "DatabaseName": "default",
        "Owner": "owner",
        "CreateTime": "2020-04-28T15:18:26+09:00",
        "UpdateTime": "2020-04-28T15:18:26+09:00",
        "LastAccessTime": "2020-04-28T15:18:26+09:00",
        "Retention": 0,
        "StorageDescriptor": {
            "Columns": [
                {
                    "Name": "host",
                    "Type": "string"
                },
                {
                    "Name": "user",
                    "Type": "string"
                },
                {
                    "Name": "method",
                    "Type": "string"
                },
                {
                    "Name": "path",
                    "Type": "string"
                },
                {
                    "Name": "code",
                    "Type": "int"
                },
                {
                    "Name": "size",
                    "Type": "int"
                },
                {
                    "Name": "referer",
                    "Type": "string"
                },
                {
                    "Name": "agent",
                    "Type": "string"
                },
                {
                    "Name": "time",
                    "Type": "string"
                }
            ],
            "Location": "s3://20200204sample/logs/",
            "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
            "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
            "Compressed": true,
            "NumberOfBuckets": -1,
            "SerdeInfo": {
                "SerializationLibrary": "org.openx.data.jsonserde.JsonSerDe",
                "Parameters": {
                    "paths": "agent,code,host,method,path,referer,size,time,user"
                }
            },
            "BucketColumns": [],
            "SortColumns": [],
            "Parameters": {
                "CrawlerSchemaDeserializerVersion": "1.0",
                "CrawlerSchemaSerializerVersion": "1.0",
                "UPDATED_BY_CRAWLER": "tmp",
                "averageRecordSize": "152",
                "classification": "json",
                "compressionType": "gzip",
                "objectCount": "7",
                "recordCount": "19",
                "sizeKey": "1148",
                "typeOfData": "file"
            },
            "StoredAsSubDirectories": false
        },
        "PartitionKeys": [],
        "TableType": "EXTERNAL_TABLE",
        "Parameters": {
            "CrawlerSchemaDeserializerVersion": "1.0",
            "CrawlerSchemaSerializerVersion": "1.0",
            "UPDATED_BY_CRAWLER": "tmp",
            "averageRecordSize": "152",
            "classification": "json",
            "compressionType": "gzip",
            "objectCount": "7",
            "recordCount": "19",
            "sizeKey": "1148",
            "typeOfData": "file"
        },
        "CreatedBy": "arn:aws:sts::xxxxxxxxxxxx:assumed-role/test-glue/AWS-Crawler",
        "IsRegisteredWithLakeFormation": false
    }
}

TableInput object

以下のドキュメントの --table-input の TableInput object の JSONファイルの構文に合わせて、上記で得た情報を転記する

転記した上で"Location"を"s3://20200204sample/logs123/"に修正する。

$ cat tmp2.json 
{
  "Name": "tmp_logs",
  "Owner": "owner",
  "Retention": 0,
  "StorageDescriptor": {
            "Columns": [
                {
                    "Name": "host",
                    "Type": "string"
                },
                {
                    "Name": "user",
                    "Type": "string"
                },
                {
                    "Name": "method",
                    "Type": "string"
                },
                {
                    "Name": "path",
                    "Type": "string"
                },
                {
                    "Name": "code",
                    "Type": "int"
                },
                {
                    "Name": "size",
                    "Type": "int"
                },
                {
                    "Name": "referer",
                    "Type": "string"
                },
                {
                    "Name": "agent",
                    "Type": "string"
                },
                {
                    "Name": "time",
                    "Type": "string"
                }
            ],
    "Location": "s3://20200204sample/logs123/",
    "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
    "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
    "Compressed": true,
    "NumberOfBuckets": -1,
    "SerdeInfo": {
      "SerializationLibrary": "org.openx.data.jsonserde.JsonSerDe",
      "Parameters": {"paths": "agent,code,host,method,path,referer,size,time,user"
        }
    },
    "BucketColumns": [],
    "SortColumns": [],
    "Parameters": {
        "CrawlerSchemaDeserializerVersion": "1.0",
        "CrawlerSchemaSerializerVersion": "1.0",
        "UPDATED_BY_CRAWLER": "tmp",
        "averageRecordSize": "152",
        "classification": "json",
        "compressionType": "gzip",
        "objectCount": "7",
        "recordCount": "19",
        "sizeKey": "1148",
        "typeOfData": "file"
    },
    "StoredAsSubDirectories": false
  },
  "PartitionKeys": [],
  "TableType": "EXTERNAL_TABLE",
  "Parameters": {
      "CrawlerSchemaDeserializerVersion": "1.0",
      "CrawlerSchemaSerializerVersion": "1.0",
      "UPDATED_BY_CRAWLER": "tmp",
      "averageRecordSize": "152",
      "classification": "json",
      "compressionType": "gzip",
      "objectCount": "7",
      "recordCount": "19",
      "sizeKey": "1148",
      "typeOfData": "file"
  }
}

update-table

注意点としては、TableInput Object(↑のJSONファイル)の内容に空白の項目があると空白でテーブルがアップデートされちゃいます。
つまり今回修正したい"Location"を"s3://20200204sample/logs123/"だけの内容のJSONファイルでupdate-tableすると、Location以外は空白のテーブルにアップデートされちゃいます。

$ aws glue update-table --database-name default --table-input file://tmp2.json --region ap-northeast-1

update後に get-table

$ aws glue get-table --database-name default --name tmp_logs --region ap-northeast-1
{
    "Table": {
        "Name": "tmp_logs",
        "DatabaseName": "default",
        "Owner": "owner",
        "CreateTime": "2020-04-28T15:18:26+09:00",
        "UpdateTime": "2020-06-10T08:47:52+09:00",
        "Retention": 0,
        "StorageDescriptor": {
            "Columns": [
                {
                    "Name": "host",
                    "Type": "string"
                },
                {
                    "Name": "user",
                    "Type": "string"
                },
                {
                    "Name": "method",
                    "Type": "string"
                },
                {
                    "Name": "path",
                    "Type": "string"
                },
                {
                    "Name": "code",
                    "Type": "int"
                },
                {
                    "Name": "size",
                    "Type": "int"
                },
                {
                    "Name": "referer",
                    "Type": "string"
                },
                {
                    "Name": "agent",
                    "Type": "string"
                },
                {
                    "Name": "time",
                    "Type": "string"
                }
            ],
            "Location": "s3://20200204sample/logs123/",
            "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
            "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
            "Compressed": true,
            "NumberOfBuckets": -1,
            "SerdeInfo": {
                "SerializationLibrary": "org.openx.data.jsonserde.JsonSerDe",
                "Parameters": {
                    "paths": "agent,code,host,method,path,referer,size,time,user"
                }
            },
            "BucketColumns": [],
            "SortColumns": [],
            "Parameters": {
                "CrawlerSchemaDeserializerVersion": "1.0",
                "CrawlerSchemaSerializerVersion": "1.0",
                "UPDATED_BY_CRAWLER": "tmp",
                "averageRecordSize": "152",
                "classification": "json",
                "compressionType": "gzip",
                "objectCount": "7",
                "recordCount": "19",
                "sizeKey": "1148",
                "typeOfData": "file"
            },
            "StoredAsSubDirectories": false
        },
        "PartitionKeys": [],
        "TableType": "EXTERNAL_TABLE",
        "Parameters": {
            "CrawlerSchemaDeserializerVersion": "1.0",
            "CrawlerSchemaSerializerVersion": "1.0",
            "UPDATED_BY_CRAWLER": "tmp",
            "averageRecordSize": "152",
            "classification": "json",
            "compressionType": "gzip",
            "objectCount": "7",
            "recordCount": "19",
            "sizeKey": "1148",
            "typeOfData": "file"
        },
        "IsRegisteredWithLakeFormation": false
    }
}

前のテーブルとUpdate後のテーブルの差分

get-tableした結果をエディタでdiff表示

UpdateTime、LastAccessTime はそれぞれ日付が異なるので差分が出ている

スクリーンショット 0002-06-10 8.56.57.png

修正したLocationは異なるので差分が出ている

スクリーンショット 0002-06-10 8.57.07.png

古いテーブルはGlue Crawlerで行ったためCreatedByの表示があり、新しいテーブルはそれがなくなっており差分が出ている

スクリーンショット 0002-06-10 10.04.43.png

こちらも是非

Glueの使い方まとめ
https://qiita.com/pioho07/items/32f76a16cbf49f9f712f

4
3
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
4
3

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?