目的
【AWS 認定ビッグデータ – 専門知識】の試験範囲、試験ポイントなどをまとめます。
参考リンク先は英語ページですが、右上に言語切替ができます。
試験概要
- 形式 :複数の選択肢と複数の答えがある問題
- 所要時間 :170 分
- 言語 :英語、日本語、韓国語、中国語 (簡体字)
- 受験料 :30,000 円 (日本語版/税別)
変な日本語に訳されることが多い、英語で受験はおすすめです。(プロフェッショナル試験そうでした)
日本語で申し込んでも、試験時、ページ上部「日本語/英語」切替ができます。
対象サービス
- Athena
- AWS Glue
- AWS Data Pipeline
- DynamoDB
- EMR
- Elasticsearch Service
- Kinesis Data Analytics
- Kinesis Data Firehose
- Kinesis Data Streams
- Kinesis Video Streams
- ML
- Amazon Comprehend
- Amazon Rekognition
- Amazon Polly
- Amazon SageMaker
- Amazon Transcribe
- Neptune
- QuickSight
- IoT
- Redshift
- S3
用語一覧
- HD:Hadoop Distributed File System
- ML:Machine Learning
- KMS:Key Management Service
- CMK:Customer Master Key
- ETL:Extract, Transform, Load
- TCO:Total Cost of Ownership
- SLA:Service Level Agreement
Athena
Query optimize
-
Partition the data in S3
-
Use Apache Parquet Format
-
Use ORC Format
-
編集中
AWS Glue
- 編集中
AWS Data Pipeline
-
Scheduled Pipeline
- Pipeline Components
- Instances
- Attempts
-
Data Nodes
- DynamoDBDataNode
- SqlDataNode
- RedshiftDataNode
- S3DataNode
- Databases
- 編集中
DynamoDB
- Capacity mode
- on-demand
- provisioned
- Adaptive Capacity
- Secondary Indexes Design
- 編集中
Elasticsearch Service
Open-source text search and analytics engine for use cases such as log analytics, real-time application monitoring and clickstream analysis and also integrate with web application seamlessly.
Features
Encryption of data at rest(KMS)
- Encrypted
- Indices
- Automated snapshots
- Elasticsearch logs
- Swap files
- All other data in the application directory
- Not encrypted
- Manual snapshots
- Slow logs and error logs
EMR
Apache Hive
- data warehouse, and analytic package that runs on top of a Hadoop cluster.
- scripts use an SQL-like language called Hive QL (query language) that abstracts programming models and supports typical data warehouse interactions.
- Hive used such as Java
- Hive extends the SQL paradigm by including serialization formats. You can also customize query processing by creating table schema that match your data, without touching the data itself. In contrast to SQL (which only supports primitive value types such as dates, numbers, and strings), values in Hive tables are structured elements, such as JSON objects, any user-defined data type, or any function written in Java.
Apache Hadoop
Apache HBase
- open source, non-relational, distributed database developed as part of the Hadoop project
- runs on top of HDFS to provide non-relational database capabilities for the Hadoop ecosystem
Apache HCatalog
- a tool that allows you to access Hive metastore tables within Pig, Spark SQL, and/or custom MapReduce applications
- has a REST interface and command line client that allows you to create tables or do other operations
Apache Phoenix
Apache ZooKeeper
EMB Notebook
編集中
Kinesis Data Analytics
Windowed Queries
-
Tumbling window
- suited for analyzing groups of data that arrive at inconsistent times. It is well suited for any time-series
analytics use case, such as a set of related sales or log records.
- suited for analyzing groups of data that arrive at inconsistent times. It is well suited for any time-series
- Sliding window
- Stagger window
- 編集中
Kinesis Data Firehose
- 編集中
Kinesis Data Streams
PutRecords Limit
- 500 records each request
- Request body max 1MB, up to 5MB for entire request, including partition keys
- Each shard support writes 1,000 records per second or 1MB per second
SSE(Server-Side Encryption)
- AWS KMS CMK
- A master key imported into the AWS KMS service
- A user-specified AWS CMK
KPL Key Concepts
- Records
- Batching
- Aggregation
- Collection
KPL Key Concepts: Batching
- Aggregation – Storing multiple records within a single Kinesis Data Streams record
- Collection – Using the API operation PutRecords to send multiple Kinesis Data Streams records to one or
more shards in your Kinesis data stream.
KCL
- 編集中
Kinesis Video Streams
- 編集中
Machine Learning(ML)
Types of ML Models
- Binary Classification Model
- problems predict a binary outcome (one of two possible classes)
- Multiclass Classification Model
- problems allow you to generate predictions for multiple classes (predict one of more than two outcomes)
- Regression Model
- problems predict a numeric value
- JupyterHub
- 編集中
Neptune
グラフデータベースサービス、下記ユースケースを強化します。
- レコメンデーションエンジン
- 不正検出
- 知識グラフ
- 創薬
- ネットワークセキュリティ
QuickSight
- 編集中
IoT
Redshift
Best Practices
High-Performance ETL Processing Practices
- COPY data from multiple, evenly sized files
- Use workload management to improve ETL runtimes
- Perform table maintenance regularly
- Perform multiple steps in a single transaction
- Loading data in bulk
- Use UNLOAD to extract large result sets
- Use Redshift Spectrum for ad hoc ETL processing
- Monitor daily ETL health using diagnostic queries
How to load data to Redshift
- Other Table
- S3
- EMR
- DynamoDB
- Remote SSH File
Database Security
- By default, privileges are granted only to the object owner.
- Amazon Redshift database users are named user accounts that can connect to a database. A user account is granted privileges explicitly, by having those privileges assigned directly to the account, or implicitly, by being a member of a group that is granted privileges.
- Groups are collections of users that can be collectively assigned privileges for easier security maintenance.
- Schemas are collections of database tables and other database objects. Schemas are similar to operating system directories, except that schemas cannot be nested. Users can be granted access to a single schema or to multiple schemas.
編集中
Redshift Spectrum
- Serverless
- Queries data from S3
S3
- 編集中