More than 5 years have passed since last update.

AWS 認定ビッグデータ専門知識学習記録

Last updated at 2019-03-24Posted at 2019-03-19

目的

【AWS 認定ビッグデータ – 専門知識】の試験範囲、試験ポイントなどをまとめます。
参考リンク先は英語ページですが、右上に言語切替ができます。

試験概要

形式　　　：複数の選択肢と複数の答えがある問題
所要時間　：170 分
言語　　　：英語、日本語、韓国語、中国語 (簡体字)
受験料　　：30,000 円 (日本語版/税別)

変な日本語に訳されることが多い、英語で受験はおすすめです。（プロフェッショナル試験そうでした）
日本語で申し込んでも、試験時、ページ上部「日本語／英語」切替ができます。

対象サービス

Athena
AWS Glue
AWS Data Pipeline
DynamoDB
EMR
Elasticsearch Service
Kinesis Data Analytics
Kinesis Data Firehose
Kinesis Data Streams
Kinesis Video Streams
ML
- Amazon Comprehend
- Amazon Rekognition
- Amazon Polly
- Amazon SageMaker
- Amazon Transcribe
Neptune
QuickSight
IoT
Redshift
S3

用語一覧

HD：Hadoop Distributed File System
ML：Machine Learning
KMS：Key Management Service
CMK：Customer Master Key
ETL：Extract, Transform, Load
TCO：Total Cost of Ownership
SLA：Service Level Agreement

Athena

Query optimize

Partition the data in S3
Use Apache Parquet Format
Use ORC Format
編集中

AWS Glue

編集中

AWS Data Pipeline

Scheduled Pipeline
- Pipeline Components
- Instances
- Attempts
Data Nodes
- DynamoDBDataNode
- SqlDataNode
- RedshiftDataNode
- S3DataNode
Databases
編集中

DynamoDB

Capacity mode
- on-demand
- provisioned
Adaptive Capacity
Secondary Indexes Design
編集中

Elasticsearch Service

Open-source text search and analytics engine for use cases such as log analytics, real-time application monitoring and clickstream analysis and also integrate with web application seamlessly.

Features

Encryption of data at rest(KMS)

Encrypted
- Indices
- Automated snapshots
- Elasticsearch logs
- Swap files
- All other data in the application directory
Not encrypted
- Manual snapshots
- Slow logs and error logs

EMR

Apache Hive

data warehouse, and analytic package that runs on top of a Hadoop cluster.
scripts use an SQL-like language called Hive QL (query language) that abstracts programming models and supports typical data warehouse interactions.
Hive used such as Java
Hive extends the SQL paradigm by including serialization formats. You can also customize query processing by creating table schema that match your data, without touching the data itself. In contrast to SQL (which only supports primitive value types such as dates, numbers, and strings), values in Hive tables are structured elements, such as JSON objects, any user-defined data type, or any function written in Java.

Apache Hadoop

Apache HBase

open source, non-relational, distributed database developed as part of the Hadoop project
runs on top of HDFS to provide non-relational database capabilities for the Hadoop ecosystem

Apache HCatalog

a tool that allows you to access Hive metastore tables within Pig, Spark SQL, and/or custom MapReduce applications
has a REST interface and command line client that allows you to create tables or do other operations

Apache Phoenix

Apache ZooKeeper

EMB Notebook

編集中

Kinesis Data Analytics

Windowed Queries

Tumbling window
- suited for analyzing groups of data that arrive at inconsistent times. It is well suited for any time-series
  analytics use case, such as a set of related sales or log records.
Sliding window
Stagger window
編集中

Kinesis Data Firehose

編集中

Kinesis Data Streams

PutRecords Limit

500 records each request
Request body max 1MB, up to 5MB for entire request, including partition keys
Each shard support writes 1,000 records per second or 1MB per second

SSE(Server-Side Encryption)

AWS KMS CMK
A master key imported into the AWS KMS service
A user-specified AWS CMK

KPL Key Concepts

Records
Batching
Aggregation
Collection

KPL Key Concepts: Batching

Aggregation – Storing multiple records within a single Kinesis Data Streams record
Collection – Using the API operation PutRecords to send multiple Kinesis Data Streams records to one or
more shards in your Kinesis data stream.

KCL

編集中

Kinesis Video Streams

編集中

Machine Learning(ML)

Types of ML Models

Binary Classification Model
- problems predict a binary outcome (one of two possible classes)
Multiclass Classification Model
- problems allow you to generate predictions for multiple classes (predict one of more than two outcomes)
Regression Model
- problems predict a numeric value
JupyterHub
編集中

Neptune

グラフデータベースサービス、下記ユースケースを強化します。

レコメンデーションエンジン
不正検出
知識グラフ
創薬
ネットワークセキュリティ

QuickSight

編集中

IoT

AWS 認定ビッグデータ専門知識学習記録 - IoT

Redshift

Best Practices

High-Performance ETL Processing Practices

COPY data from multiple, evenly sized files
Use workload management to improve ETL runtimes
Perform table maintenance regularly
Perform multiple steps in a single transaction
Loading data in bulk
Use UNLOAD to extract large result sets
Use Redshift Spectrum for ad hoc ETL processing
Monitor daily ETL health using diagnostic queries

How to load data to Redshift

Other Table
S3
EMR
DynamoDB
Remote SSH File

Database Security

By default, privileges are granted only to the object owner.
Amazon Redshift database users are named user accounts that can connect to a database. A user account is granted privileges explicitly, by having those privileges assigned directly to the account, or implicitly, by being a member of a group that is granted privileges.
Groups are collections of users that can be collectively assigned privileges for easier security maintenance.
Schemas are collections of database tables and other database objects. Schemas are similar to operating system directories, except that schemas cannot be nested. Users can be granted access to a single schema or to multiple schemas.

編集中

Redshift Spectrum

Serverless
Queries data from S3

S3

編集中

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

AWS 認定ビッグデータ 専門知識 学習記録

目的