LoginSignup
0
1

More than 5 years have passed since last update.

AWS 認定ビッグデータ 専門知識 学習記録

Last updated at Posted at 2019-03-19

目的

AWS 認定ビッグデータ – 専門知識】の試験範囲、試験ポイントなどをまとめます。
参考リンク先は英語ページですが、右上に言語切替ができます。

試験概要

  • 形式   :複数の選択肢と複数の答えがある問題
  • 所要時間 :170 分
  • 言語   :英語、日本語、韓国語、中国語 (簡体字)
  • 受験料  :30,000 円 (日本語版/税別)

変な日本語に訳されることが多い、英語で受験はおすすめです。(プロフェッショナル試験そうでした)
日本語で申し込んでも、試験時、ページ上部「日本語/英語」切替ができます。

対象サービス

  • Athena
  • AWS Glue
  • AWS Data Pipeline
  • DynamoDB
  • EMR
  • Elasticsearch Service
  • Kinesis Data Analytics
  • Kinesis Data Firehose
  • Kinesis Data Streams
  • Kinesis Video Streams
  • ML
    • Amazon Comprehend
    • Amazon Rekognition
    • Amazon Polly
    • Amazon SageMaker
    • Amazon Transcribe
  • Neptune
  • QuickSight
  • IoT
  • Redshift
  • S3

用語一覧

  • HD:Hadoop Distributed File System
  • ML:Machine Learning
  • KMS:Key Management Service
  • CMK:Customer Master Key
  • ETL:Extract, Transform, Load
  • TCO:Total Cost of Ownership
  • SLA:Service Level Agreement

Athena

Query optimize

  • Partition the data in S3
  • Use Apache Parquet Format
  • Use ORC Format

  • 編集中

AWS Glue

  • 編集中

AWS Data Pipeline

DynamoDB

Elasticsearch Service

Open-source text search and analytics engine for use cases such as log analytics, real-time application monitoring and clickstream analysis and also integrate with web application seamlessly.

Features

Encryption of data at rest(KMS)

  • Encrypted
    • Indices
    • Automated snapshots
    • Elasticsearch logs
    • Swap files
    • All other data in the application directory
  • Not encrypted
    • Manual snapshots
    • Slow logs and error logs

EMR

Apache Hive

  • data warehouse, and analytic package that runs on top of a Hadoop cluster.
  • scripts use an SQL-like language called Hive QL (query language) that abstracts programming models and supports typical data warehouse interactions.
  • Hive used such as Java
  • Hive extends the SQL paradigm by including serialization formats. You can also customize query processing by creating table schema that match your data, without touching the data itself. In contrast to SQL (which only supports primitive value types such as dates, numbers, and strings), values in Hive tables are structured elements, such as JSON objects, any user-defined data type, or any function written in Java.

Apache Hadoop

Apache HBase

  • open source, non-relational, distributed database developed as part of the Hadoop project
  • runs on top of HDFS to provide non-relational database capabilities for the Hadoop ecosystem

Apache HCatalog

  • a tool that allows you to access Hive metastore tables within Pig, Spark SQL, and/or custom MapReduce applications
  • has a REST interface and command line client that allows you to create tables or do other operations

Apache Phoenix

Apache ZooKeeper

EMB Notebook

編集中

Kinesis Data Analytics

Windowed Queries

  • Tumbling window
    • suited for analyzing groups of data that arrive at inconsistent times. It is well suited for any time-series
      analytics use case, such as a set of related sales or log records.
  • Sliding window
  • Stagger window
  • 編集中

Kinesis Data Firehose

  • 編集中

Kinesis Data Streams

PutRecords Limit

  • 500 records each request
  • Request body max 1MB, up to 5MB for entire request, including partition keys
  • Each shard support writes 1,000 records per second or 1MB per second

SSE(Server-Side Encryption)

  • AWS KMS CMK
  • A master key imported into the AWS KMS service
  • A user-specified AWS CMK

KPL Key Concepts

  • Records
  • Batching
  • Aggregation
  • Collection

KPL Key Concepts: Batching

  • Aggregation – Storing multiple records within a single Kinesis Data Streams record
  • Collection – Using the API operation PutRecords to send multiple Kinesis Data Streams records to one or
    more shards in your Kinesis data stream.

KCL

  • 編集中

Kinesis Video Streams

  • 編集中

Machine Learning(ML)

Types of ML Models

  • Binary Classification Model
    • problems predict a binary outcome (one of two possible classes)
  • Multiclass Classification Model
    • problems allow you to generate predictions for multiple classes (predict one of more than two outcomes)
  • Regression Model
    • problems predict a numeric value
  • JupyterHub
  • 編集中

Neptune

グラフデータベースサービス、下記ユースケースを強化します。

  • レコメンデーションエンジン
  • 不正検出
  • 知識グラフ
  • 創薬
  • ネットワークセキュリティ

QuickSight

  • 編集中

IoT

AWS 認定ビッグデータ 専門知識 学習記録 - IoT

Redshift

Best Practices

High-Performance ETL Processing Practices

  • COPY data from multiple, evenly sized files
  • Use workload management to improve ETL runtimes
  • Perform table maintenance regularly
  • Perform multiple steps in a single transaction
  • Loading data in bulk
  • Use UNLOAD to extract large result sets
  • Use Redshift Spectrum for ad hoc ETL processing
  • Monitor daily ETL health using diagnostic queries

How to load data to Redshift

  • Other Table
  • S3
  • EMR
  • DynamoDB
  • Remote SSH File

Database Security

  • By default, privileges are granted only to the object owner.
  • Amazon Redshift database users are named user accounts that can connect to a database. A user account is granted privileges explicitly, by having those privileges assigned directly to the account, or implicitly, by being a member of a group that is granted privileges.
  • Groups are collections of users that can be collectively assigned privileges for easier security maintenance.
  • Schemas are collections of database tables and other database objects. Schemas are similar to operating system directories, except that schemas cannot be nested. Users can be granted access to a single schema or to multiple schemas.

編集中

Redshift Spectrum

  • Serverless
  • Queries data from S3

S3

  • 編集中
0
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
1