LoginSignup
1
1

More than 3 years have passed since last update.

AWS Summit - AWS Glue, AWS Lake Formation で実現するServerless Analystic

Last updated at Posted at 2019-06-12

AWS Glue概要

  • マネジメントサーバレスETLサービス
  • 開発者、データサイエンティスト向けのサービス
  • 35+ 機能
  • データのカタログ化
    • Auto Glowing
    • Apache Hive Metastore互換
    • 分析サービスとの統合
  • サーバレスエンジン
    • Apache Spark
    • Python shell
    • Bach job
    • インテラクティグ?
  • Auto Scalation
    • Schedule

データディスカバリー

  • Performance
    • 1日9000万

サイエンス

  • Apache Spark
    • provision、管理不要
    • Auto Scaling
    • オンデマンド
  • Apache Spark Core: RDD
  • Data Frame
    • SparkSQL core data
    • SQLのような分析に適合
  • Dynamic Frame
    • Recored schema every data, 前列のスキーマ不要
    • 単一パスで多数のフローを実施する
    • Glue Parquet Writer
    • 標準Parquet Writer
    • Glue Parquet Writer
    • Performance
    • 構成, 10DPU, Apache Spark 2.
    • WorkLoad
      • JSON -> Parquet
    • DynamicFrame 78s
    • DataFrame 195s

AWS Glue実行モデル

  • Driver -> Multiple Executor
  • 連続敵なLogging
  • Remove Apache Spark log message filter out
  • Progressbar
  • Job Metric
    • base Apache Spark metrics
    • driver egze
      • 30s summary
      • real time cloudwatch
  • Memory monitoring
    • DataFrame many small file task, too more task, too many memory used
    • DynamicFrame auto group task by small file
    • worker types
      • default
      • G.1x
      • G.2x
  • Python shell
    • SQL base anaylice
    • middle size ML
    • Python 2.7 / 3.6 supported
      • boto3, awscli, numpy, scipy, pandas,... installed
    • spinup: under 20s
    • netword address supported.
    • size: 1DPU, 1/16 DPU
  • Python shell filtering
    • cost : 0.6$

auto scalition

  • event base
    • lambda
  • schedual event
  • entity
    • glue
    • job
    • trigger
  • event
    • schedule
    • event
    • extenal
  • control
    • ...
    • workflow feature
  • authoring DAG
  • workflow rerun
  • moniting
    • Updates
  • network
    • Reverse DNS support
    • VPC endpoint support for Glue
  • Job, trigger -> Resouce tagging
  • notifications

    • AWS Lake Formation
  • secure construction and manage

  • Sample of steps required

    • Find Resouces
    • Create S3 locations
    • Configure access policies
    • Map tables to Amazon S3 locations
    • ETL jobs
    • Create metadata access policies
    • Configure access from analytics services
    • Rinse and repeat for other
    • Manaul | Error | ??
  • Collecting and クレンジング

  • dataをセキュアに保管

  • Security

    • SQL style grant revoke permissions
    • EMR-Spark, Athena, Redshift, Glue
  • Collection

    • ML transforms for fuzzy record matching
    • Blueprints: Cloudtrail / ALB
  • Data discovery

  • Rap of Glue

  • コンプラインア

    • HIPAA BAA
    • ISO
    • PCI
    • ???
1
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
1