Google Cloud Professional Data Engineer 模擬試験メモ

Last updated at 2024-10-21Posted at 2024-09-12

回答メモ

1

https://cloud.google.com/bigquery/docs/materialized-views-intro?hl=ja
- BigQuery のマテリアライズドビューは事前に計算されたビューで、パフォーマンスと効率を向上させるためにクエリの結果を定期的にキャッシュに保存します
- BigQuery は、事前に計算されたマテリアライズドビューの結果を利用し、可能な場合にはベーステーブルからの差分のみを読み取って最新の結果を計算します
- マテリアライズドビューを使用したクエリは通常、同じデータをベーステーブルのみから取得するクエリよりも高速で、消費するリソースも少なくて済みます
- ベーステーブルのすべての増分データの変更が自動的にマテリアライズドビューに追加されます
- マテリアライズドビューは最新のデータを返します
- ベーステーブルに対するクエリの一部がマテリアライズドビューへのクエリによって解決できる場合は、マテリアライズドビューを使用するように BigQuery によってクエリのルートが変更され、パフォーマンスと効率が向上します
- マテリアライズドビューでは、コンピューティングコストが高く、データセットの結果が小さいクエリを最適化できます
  - データの事前集計
  - テーブルの特定のサブセットのみを読み取るクエリ
  - クエリの結合、特に大きなテーブルと小さなテーブル間のクエリの結合
  - ベーステーブルとは異なるクラスタリングスキームのメリットが見込まれるクエリ
- 他の BigQuery 機能との連携
  - クエリプランには、スキャンされたマテリアライズドビュー（存在する場合）が反映され、マテリアライズドビューとそのベーステーブルから読み取られたバイト数が示されます
  - マテリアライズドビューを使用して書き換えられたクエリの結果は、通常の制限（確定関数の使用、ベーステーブルへのストリーミングなし、など）の下でキャッシュに保存できます
  - 課金バイト数の上限を設定していて、クエリによって読み取られたバイト数がその上限を超えた場合は、そのクエリで使用されていたのがマテリアライズドビュー、ベーステーブル、またはその両方かにかかわらず、クエリは課金されずに失敗します
  - ドライランによるコストの見積もり
  - BigLake のメタデータキャッシュ対応テーブルに対するマテリアライズドビューでは、Cloud Storage と Amazon Simple Storage Service（Amazon S3）に保存されている構造化データを参照できます
    - Amazon S3 BigLake テーブルに対するマテリアライズドビューを作成する場合、マテリアライズドビューのデータは BigQuery データとの結合には使用できません
- 制限事項
  - マテリアライズドビューのデータを COPY、EXPORT、LOAD、WRITE、データ操作言語（DML）ステートメントなどのオペレーションで直接更新したり、操作することはできません
  - マテリアライズドビューの作成後にビュー SQL の更新はできません
  - マテリアライズドビューを他のマテリアライズドビューにネストすることはできません
  - マテリアライズドビューでは、外部テーブルまたはワイルドカードテーブル、論理ビュー1、スナップショット、あるいは変更データキャプチャが有効なテーブルに対してクエリを実行することはできません
  - マテリアライズドビューの説明を設定できますが、マテリアライズドビュー内の個々の列には説明を設定できません
- マテリアライズドビューの料金
  - マテリアライズドビューに対するクエリ
  - メンテナンス: 更新時に処理されたバイト数
  - ストレージ: マテリアライズドビューに保存されているバイト数
https://cloud.google.com/bigquery/docs/materialized-views-create?hl=ja
- クエリの制限事項
  - マテリアライズドビューのクエリの集計は出力である必要があります
    - 集計値に基づく計算、フィルタリング、結合はサポートされていません
- サポートされていない SQL 機能
  - UNION ALL（プレビュー版でのサポート）
  - LEFT OUTER JOIN（プレビュー版でのサポート）
    - 増分マテリアライズドビューは、LEFT OUTER JOIN と UNION ALL をサポートしています
  - RIGHT/FULL OUTER JOIN
  - 自己結合（同じテーブルで JOIN を複数回使用する）
  - ウィンドウ関数
  - ARRAY サブクエリ
  - RAND()、CURRENT_DATE()、SESSION_USER()、CURRENT_TIME() などの非確定的関数
  - ユーザー定義関数（UDF）
  - TABLESAMPLE
  - FOR SYSTEM_TIME AS OF
- パーティション分割テーブルのマテリアライズドビューはパーティショニングできます
  - マテリアライズドビューのパーティション分割は、クエリがパーティションのサブセットにアクセスすることが多い場合にメリットが得られる点で、通常のテーブルのパーティショニングに似ています
  - マテリアライズドビューをパーティショニングすることで、ベーステーブルまたはテーブルのデータが変更または削除されたときのビューの動作を改善できます
- BigQuery のクラスタ化テーブルの制限を満たす場合に限り、マテリアライズドビューは出力列でクラスタ化できます
  - 集計出力列は、クラスタリング列として使用できません
  - マテリアライズドビューにクラスタリング列を追加すると、その列のフィルタを含むクエリのパフォーマンスが向上します
- マテリアライズドビューのクエリは論理ビューを参照できますが、次の制限があります
  - マテリアライズドビューの制限事項が適用されます
  - 論理ビューが変更されると、マテリアライズドビューが無効になり、完全に更新する必要があります
  - スマートな調整はサポートされていません
- 結合: 推奨事項は、JOIN を使用したマテリアライズドビューに適用されます
  - 最も大きいまたは最も頻繁に変更するテーブルがビュークエリで参照される最初のテーブル/左端のテーブルであることを確認します
  - クラスタリングキーでの結合を回避する
    - 結合を使用したマテリアライズドビューは、データが頻繁に集計される場合や元の結合クエリのコストが高い場合に最も効果的です
- max_staleness マテリアライズドビューオプションを使用すると、頻繁に変更される大規模なデータセットを処理する際のコストを抑えながら、一貫した高いパフォーマンスを実現できます
  - この動作は、データの更新速度が必須ではないダッシュボードとレポートで役立ちます
  - 最後の更新が max_staleness 間隔内の場合、BigQuery はベーステーブルを読み取らず、マテリアライズドビューから直接データを返します
  - 最後の更新が max_staleness 期間外の場合、クエリはベーステーブルからデータを読み取り、未更新間隔内の結果を返します
- 非増分マテリアライズドビュー
  - 非増分マテリアライズドビューは、OUTER JOIN、UNION、HAVING 句、分析関数など、ほとんどの SQL クエリをサポートします
  - バッチデータ処理やレポートなど、データの未更新が許容されるシナリオでは、非増分のマテリアライズドビューを使用すると、クエリのパフォーマンスが向上し、コストを削減できます
  - 非増分マテリアライズドビューを作成するには、allow_non_incremental_definition オプションを使用します。このオプションは、max_staleness オプションとともに使用する必要があります

2

3

Cloud SQL Auth Proxy は、承認済みネットワークや SSL の構成を必要とせず、安全にインスタンスにアクセスできる Cloud SQL コネクタです
Cloud SQL Auth Proxy は、TLS 1.3 を使用してデータベースとの間で送受信されるトラフィックを自動的に暗号化します
Cloud SQL Auth Proxy では、IAM 権限を使用して、Cloud SQL インスタンスに接続できるユーザーと対象を制御します
必要に応じて、Cloud SQL Auth Proxy は OAuth 2.0 アクセストークンの自動更新をサポートします
プライベート IP を使用して Cloud SQL インスタンスに接続するには、Cloud SQL Auth Proxy が、そのインスタンスと同じ VPC ネットワークへアクセスできるリソース上に存在する必要があります
Cloud SQL Auth Proxy は、ローカル環境で動作するローカルクライアントがあることで機能します
- アプリケーションが Cloud SQL Auth Proxy に接続すると、ターゲット Cloud SQL インスタンスとの既存の接続が使用可能かどうかを確認します
- 接続が存在しない場合、Cloud SQL Admin API を呼び出してエフェメラル SSL 証明書を取得し、それを使用して Cloud SQL に接続します
- Cloud SQL Auth Proxy はどのポートでもリッスンできますが、ポート 3307 にのみ Cloud SQL インスタンスへの送信接続または下り（外向き）接続が作成されます
- Cloud SQL Auth Proxy は、固定 IP アドレスを持たないドメイン名 sqladmin.googleapis.com を使用して API を呼び出すため、ポート 443 ですべての下り（外向き）TCP 接続を許可する必要があります
- インスタンスでパブリック IPv4 アドレスが指定されているか、プライベート IP を使用するように構成されている必要があります
本番環境で Cloud SQL Auth Proxy を使用する
- Cloud SQL Auth Proxy が永続サービスとして実行されていることを確認する
- アプリケーションに必要な Cloud SQL Auth Proxy のコピー数
- 高可用性向けに構成されたインスタンス上で Cloud SQL Auth Proxy を動作させている最中にフェイルオーバーが発生した場合、Cloud SQL Auth Proxy を介した接続は、IP 経由の接続と同様に影響を受けます
- Cloud SQL Auth Proxy Docker イメージを最新の状態に保つ
- ConnectorEnforcement を使用すると、Cloud SQL インスタンス接続で Cloud SQL Auth Proxy の使用を強制できます
一部の Google Cloud サービスとアプリケーションでは、Cloud SQL Auth プロキシを使用して暗号化と認可を行い、パブリック IP パスの接続を可能にしています
- App Engine スタンダード環境
- App Engine フレキシブル環境
- Cloud Functions
- Cloud Run
- Google Kubernetes Engine で実行されるアプリケーションは、Cloud SQL Auth Proxy を使用して接続できます
  - Google Kubernetes Engine から接続する場合は、アプリケーションと Pod を共有する追加のコンテナとして、Cloud SQL Auth Proxy を sidecar パターンで実行することをおすすめします
Cloud SQL Auth Proxy を認証すると、このプロキシは一連の Google 認証情報を使用して、アプリケーションの代わりに Google Cloud へアクセスできるようになり、これは、データベースのユーザー認証とは別のものです
- credential_file フラグによって指定された認証情報
- アクセストークンによって指定された認証情報
- 環境変数によって指定された認証情報
- 認証済みの gcloud CLI クライアントからの認証情報
  - この方法は、開発環境の運用を開始する際に特に便利です
- Compute Engine インスタンスに関連付けられている認証情報
- 環境のデフォルトのサービスアカウント

4

5

A schema is a namespace that contains database objects, such as tables, views, indexes, and functions
You can also further segment and store rows in your database table across different geographic regions
Data in Spanner is strongly typed
The table schema defines one or more table columns as the table's primary key which uniquely identifies each row
- be careful not to create hotspots with the choice of your primary key
- techniques that can spread the load across multiple servers and avoid hotspots
  - hash keys
  - swap the order of keys
  - v4 UUID
  - Bit reverse sequential values
Parent-child table relationships
- With interleaving, Spanner physically colocates child rows with parent rows in storage
  - if you have a Customers table and an Invoices table, and your application frequently fetches all the invoices for a customer, you can define Invoices as an interleaved child table of Customers
- Foreign keys are a more general parent-child solution and address additional use cases
  - However, a foreign key relationship does not imply co-location of the tables in the storage layer
- Google recommends that you choose to represent parent-child relationships either as interleaved tables or as foreign keys, but not both
- You can define hierarchies of interleaved parent-child relationships up to seven layers deep, which means that you can colocate rows of seven independent tables
- Spanner is a distributed database, which means that as your database grows, Spanner divides your data into chunks called "splits."
- if you follow best practices for schema design, Spanner can mitigate hotspots such that the read throughput should improve every few minutes until you saturate the resources in your instance or run into cases where no new split boundaries can be added
Design for multi-tenancy
- Another way to design for multi-tenancy in Spanner is to have all customers in a single table in a single database, and to use a different primary key value for each customer
- There are limits on the number of databases per instance and the number of tables and indexes per database
- Adding new tables and non-interleaved indexes can take a long time

6

General concepts
- components:
  - Row keys
  - Column families, including their garbage collection policies
  - Columns
- In Bigtable, schema design is driven primarily by the queries, or read requests, that you plan to send to the table
- A secondary consideration is the avoidance of hotspots – to prevent hotspots, you need to consider write patterns and how you can avoid accessing a small key space in a short amount of time
- Bigtable is a key/value store, not a relational store. It does not support joins, and transactions are supported only within a single row
- Each table has only one index, the row key. There are no secondary indexes. Each row key must be unique
- Rows are sorted lexicographically by row key
- Column families are not stored in any specific order
- Columns are grouped by column family and sorted in lexicographic order within the column family
- The intersection of a row and column can contain multiple timestamped cells
- Aggregate column families contain aggregate cells
- All operations are atomic at the row level
- Ideally, both reads and writes should be distributed evenly
- Bigtable tables are sparse
Best practices
- Tables
  - Store datasets with similar schemas in the same table
  - You can assign a unique row key prefix to use for each dataset, so that Bigtable stores the related data in a contiguous range of rows that you can then query by row key prefix
  - avoid creating a large number of tables
- Column families
  - Put related columns in the same column family
  - Create up to about 100 column families per table
  - Choose short names for your column families: Names are included in the data that is transferred for each request
  - Put columns that have different data retention needs in different column families
  - Garbage collection policies are set at the column family level, not at the column level
- Columns
  - Create as many columns as you need in the table
    - You can have millions of columns in a table, as long as no row exceeds the maximum limit of 256 MB per row
  - Avoid using too many columns in any single row
    - Even though a table can have millions of columns, a row shouldn't
    - If your dataset logically requires more columns per row than Bigtable can process efficiently, consider storing the data as a protobuf in a single column
  - Optionally, you can treat column qualifiers as data
- Rows
  - Keep the size of all values in a single row under 100 MB
  - Keep all information for an entity in a single row
    - avoid storing data that you must read atomically, or all at once, in more than one row to avoid inconsistencies
  - Store related entities in adjacent rows
- Cells
  - Don't store more than 10 MB of data in a single cell
  - Use aggregate cells to store and update aggregate data
- Row keys
  - Design your row key based on the queries you will use to retrieve the data
    - Row key
    - Row key prefix
    - Range of rows defined by starting and ending row keys
  - Keep your row keys short
  - Store multiple delimited values in each row key
    - If your data includes integers that you want to store or sort numerically, pad the integers with leading zeroes
  - Use human-readable string values in your row keys
    - key visualizer tool
  - you should design row keys that start with a common value and end with a granular value
  - Row keys to avoid
    - Row keys that start with a timestamp
    - Row keys that cause related data to not be grouped
    - Sequential numeric IDs
    - Frequently updated identifiers
      - e.g. device_id#memusage
    - Hashed values
      - impossible to store rows in a way that are optimal for querying
    - Values expressed as raw bytes
- Special use cases
  - Time based data
    - Include a timestamp as part of your row key
    - Don't use a timestamp by itself or at the beginning of a row key
    - If you usually retrieve the most recent records first, you can use a reversed timestamp in the row key by subtracting the timestamp from your programming language's maximum value for long integers
  - Multi-tenancy
    - Using one table for all tenants is the most efficient way to store and access multi-tenant data
  - Privacy
    - avoid using personally identifiable information (PII) or user data in row keys or column family IDs
  - Domain names
    - Wide range of domain names
      - Using a reverse domain name is an especially good idea if each row's data tends to overlap with adjacent rows
    - Few domain names
      - consider other values for your row key
  - Changing or uncertain queries
    - store all the data for a row in one column instead of multiple columns

7

Project-level custom quotas limit the aggregate usage of all users in that project
User-level custom quotas are separately applied to all users and service accounts within a project
You can set a custom quota or modify an existing custom quota for any quota displayed on the Quotas page of Google Cloud console

8

Key Visualizer is a tool that helps you analyze your Bigtable usage patterns
- Iteratively designing a schema or improving the design of an existing schema
- Troubleshooting performance issues
- Getting a better understanding of how you access the data that you store in Bigtable

9

Cloud VPN
- low cost
- IPsec
- 1.5~3.0 Gbps
Verified Peering Provider
Peering with Google
- ISP level network
Cross Cloud Interconnect
- AWS, Azure
PSC over interconnect
- <= 50 Gbps
Dedicated Interconnect
- => 10 Gbps

Cloud Interconnect provides low-latency, high-availability connections that enable you to reliably transfer data between your Google Cloud Virtual Private Cloud (VPC) networks and your other networks
Cloud Interconnect connections provide internal IP address communication
don't need to use a NAT device or VPN tunnel to reach internal IP addresses
doesn't traverse the public internet
capacity
- 8 x 10-Gbps connections (80 Gbps total)
- 2 x 100-Gbps connections (200 Gbps total)
Cloud Interconnect doesn't encrypt traffic by default
- You can use MACsec for Cloud Interconnect to help secure traffic

Enabling online mode allows you to perform online transfers by streaming data directly to your Cloud Storage bucket after copying it to your appliance
- Quickly transfer data to Cloud Storage with low latency
- Connect to multiple appliances
Other options
- Storage transfer service
- bigquery data transfer service
- transfer service for on-premise data
area
- us
- eu
- uk
- singapore
- jp
- canada
with Transfer Appliance, you can receive the appliance and capture 300 terabytes of data in under 25 days
- With a typical network bandwidth of 100 Mbps, 300 terabytes of data takes about 9 months to upload

10

You can scale a Dataproc cluster at any time, even when jobs are running on the cluster
- You cannot change the machine type of an existing cluster
You can scale the number of primary workers or the number of secondary (preemptible) workers, or both
Graceful Decommissioning
- finish work in progress on a worker before it is removed from the Cloud Dataproc cluster
- The preemptible (secondary) worker group continues to provision or delete workers to reach its expected size even after a cluster scaling operation is marked complete
- You can forcefully decommission preemptible workers at any time
- You gracefully decommission primary workers at any time

An Autoscaling Policy is a reusable configuration that describes how cluster workers using the autoscaling policy should scale
cases
- on clusters that store data in external services, such as Cloud Storage or BigQuery
- on clusters that process many jobs
- to scale up single-job clusters
- with Enhanced Flexibility Mode for Spark Batch jobs
not recommended with/for
- HDFS
- YARN Node Labels
- Spark Structured Streaming
- Idle Clusters
  - instead, use workflow and composer
- Different sized workloads
Autoscaling configuration recommendations
- Avoid scaling primary workers
- Use Enhanced Flexibility Mode for Spark Batch jobs
  - faster scale down
  - prevent disruption
  - minimize disruption
  - With EFM enabled, an autoscaling policy's graceful decommissioning timeout must be set to 0s
- Choosing a graceful decommissioning timeout
  - longer than the longest job that the cluster will process
- scaleUpFactor controls how aggressively the autoscaler scales up a cluster
  - A good starting point is 0.05 for MapReduce jobs and Spark jobs with dynamic allocation enabled
- scaleDownFactor controls how aggressively the autoscaler scales down a cluster
  - Leave this value at 1.0 for most multi-job clusters that need to scale up and down frequently

11

Exports use database resources, but they do not interfere with normal database operations unless the instance is under-provisioned
The following are best practices to consider when importing and exporting data:
- Don't use Cloud Storage Requester Pays buckets
- Compress data to reduce cost
- Reduce long-running import and export processes
  - You can't stop a long-running Cloud SQL instance operation
  - You can perform only one import or export operation at a time for each instance, and a long-running import or export blocks other operations, such as daily automated backups
- Use the bulk copy program utility for importing and exporting data
- Use bulk insert for importing data
- Use SqlPackage for importing and exporting data
- Use striped import and export
  - When you perform a striped import or export, you reduce the time it takes for the operation to complete, and enable databases larger than 5 TB to be imported and exported
- Verify the imported database

relational
- alloy db for postgresql
  - heterogeneous migrations
  - legacy applications
  - enterprise workloads
  - hybdir cloud, multicloud, and edge
- cloud sql
  - mysql, postgresql, sql server
  - crm, erp, ecommerce, web, saas
- spanner
  - scale, global consistency, 99.99% availability
  - 3B req / sec at peak
  - gaming, retail, global financial ledger supply chain, inventory management
- bigquery
  - serverless data warehouse, 99.99% avilability
  - multicloud analytics, realtime processing, built-in machine learning
key value
- bigtable
  - highly performant fully managed nosql
  - 99.999% availability
  - 7B req / sec at peak
  - personalization, adtech, recommendation engine, fraud detection
document
- firestore
  - highly scalable, 99.999% availability
  - mobile/web/iot applications, realtime sync, offline sync
- firebase
  - store and sync data in realtime
  - mobile sign-ins, personalized app/ads, in-app chat
in-memory
- memorystore
  - managed redis and memcached
  - caching, gaming, leaderboard, social chat or news feed
additional nosql
- mongodb atlas
  - global cloud database
  - mobile/web/iot apps, gaming, content management, single view

When a table size reaches 16 TB, the maximum size for Linux partitions, no additional data can be added
Cloud SQL storage limits:
- Dedicated core: Up to 64 TB
- Shared core: Up to 3 TB

12

With Connected Sheets, you can access, analyze, visualize, and share billions of rows of BigQuery data from your Google Sheets spreadsheet
If you want to give a user Google Sheets access only, share a spreadsheet and don't grant BigQuery access
If you have the required allow permissions and meet the VPC Service Controls access restrictions, you can configure the VPC Service Controls perimeter to allow queries issued through Connected Sheets

resource.type="bigquery_resource"
protoPayload.metadata.firstPartyAppMetadata.sheetsMetadata.docId != NULL_VALUE

BigQuery uses temporary tables to cache query results that aren't written to a permanent table
After a query finishes, the temporary table exists for up to 24 hours
You can use the Google Cloud console to download query results to a local file, Google Sheets, or Google Drive

13

When designing your pipeline, consider the following questions:
- Where is your pipeline's input data stored? How many sets of input data do you have?
- What does your data look like?
- What do you want to do with your data?
- Where should your pipeline's output data go?
- Does your Dataflow job use Assured Workloads?
To accelerate pipeline development, instead of building a pipeline by writing Apache Beam code, use a Dataflow template when possible
- Templates are reusable
- Templates let you customize each job by changing specific pipeline parameters
- Anyone you provide permissions to can use the template to deploy the pipeline
- You can use a Google-provided template, or you can create your own template
Assured Workloads helps enforce security and compliance requirements for Google Cloud customers
There is no Dataflow-specific cross pipeline communication mechanism for sharing data or processing context between pipelines
- You can use durable storage like Cloud Storage or an in-memory cache like App Engine to share data between pipeline instances
Schedule jobs
- cloud scheduler
- airflow
- custom jobs on compute engine
Best practices for writing pipeline code
- Structure your Apache Beam code
  - To create pipelines, it's common to use the generic ParDo parallel processing Apache Beam transform
  - When you apply a ParDo transform, you provide code in the form of a DoFn object
  - we recommend creating pure functions, which are ideal for the parallel and distributed nature of DoFn elements
  - When you use exactly-once processing, the Dataflow service guarantees that every element in your input PCollection is processed by a DoFn instance exactly once
  - The Dataflow service doesn't guarantee how many times a DoFn is invoked
  - The Dataflow service doesn't guarantee exactly how the distributed elements are grouped
  - The Dataflow service doesn't guarantee the exact number of DoFn instances created over the course of a pipeline
  - The Dataflow service is fault-tolerant and might retry your code multiple times if the workers encounter issues
  - The Dataflow service might create backup copies of your code
    - Issues might occur with manual side effects, such as if your code relies on or creates temporary files with non-unique names
  - The Dataflow service serializes element processing per DoFn instance
- Create libraries of reusable transforms
  - By creating a shared library of common transforms, you can improve reusability, testability, and code ownership by different teams
- Use dead-letter queues for error handling
  - use a pattern called a dead-letter queue (unprocessed messages queue)
  - Instead of dropping the failed element, use branching outputs to write failed elements into a separate PCollection object
  - You can handle data that has unexpected but valid schemas by using a dead-letter pattern, which writes failed elements to a separate PCollection object
- Decide how to join datasets
  - You can use side inputs or the CoGroupByKey transform to perform joins in your pipeline
  - Unlike PCollection objects, side inputs are mutable and can be determined at runtime
  - side input sizes can be very large and might not fit into worker memory
  - CoGroupByKey transform is a core Apache Beam transform that merges (flattens) multiple PCollection objects and groups elements that have a common key
  - CoGroupByKey performs a shuffle (grouping) operation to distribute data across workers
  - Use side inputs when one of the PCollection objects you are joining is disproportionately smaller than the others, and the smaller PCollection object fits into worker memory
  - Use side inputs when you have a PCollection object that must be joined multiple times in your pipeline
  - Use CoGroupByKey if you need to fetch a large proportion of a PCollection object that significantly exceeds worker memory
- Minimize expensive per-element operations
  - Because the DoFn.ProcessElement method is called for every element, any time-consuming or computationally expensive operations that are invoked by that method run for every single element processed by the method
  - If you need to perform costly operations only once for a batch of elements, include those operations in the DoFn.Setup method or the DoFn.StartBundle method instead of in the DoFn.ProcessElement element
    - Parsing a configuration file that controls some aspect of the DoFn instance's behavior
    - Instantiating a short-lived client that is reused across all elements in a bundle, such as when all elements in the bundle are sent over a single network connection
- Limit batch sizes and concurrent calls to external services
  - When you call external services, you can reduce per-call overheads by using the GroupIntoBatches transform
  - limit the maximum number of parallel (concurrent) calls to the external service by choosing appropriate keys to partition the incoming data
    - Choose an attribute of the dataset to use as data keys, such as user IDs
    - Generate data keys to split elements randomly over a fixed number of partitions, where the number of possible key values determines the number of partitions
- Identify performance issues caused by fused steps
  - Dataflow might fuse some operations together, a process known as fusion optimization, to avoid the performance and cost impact of writing every intermediate PCollection object in your pipeline
  - you can prevent operations from being fused
  - One method to unfuse steps is to insert a GroupByKey transform and ungroup before the step
  - KV objects with duplicate keys are deduplicated to a single key in the group (GroupByKey) transform and the ungroup (Keys) transform
- Use Apache Beam metrics to collect pipeline insights

14

Unbounded PCollections, or unbounded collections, represent data in streaming pipelines
You can use windows, watermarks, and triggers to aggregate elements in unbounded collections
Tumbling windows (called fixed windows in Apache Beam)
- A tumbling window represents a consistent, disjoint time interval in the data stream
Hopping windows (called sliding windows in Apache Beam)
- Hopping windows can overlap
Session windows
- A session window contains elements within a gap duration of another element
A watermark is a threshold that indicates when Dataflow expects all of the data in a window to have arrived
- If the watermark has progressed past the end of the window and new data arrives with a timestamp within the window, the data is considered late data
- Dataflow tracks watermarks because of the following reasons:
  - Data is not guaranteed to arrive in time order or at predictable intervals
  - Data events are not guaranteed to appear in pipelines in the same order that they were generated
Triggers
- By default, results are emitted when the watermark passes the end of the window

15

Bigtable is ideal for applications that need high throughput and scalability for key-value data, where each value is typically no larger than 10 MB
Time-series data, such as CPU and memory usage over time for multiple servers
Marketing data, such as purchase histories and customer preferences
Financial data, such as transaction histories, stock prices, and currency exchange rates
Internet of Things data, such as usage reports from energy meters and home appliances
Graph data, such as information about how users are connected to one another
The table is composed of rows, each of which typically describes a single entity, and columns, which contain individual values for each row
- Each row is indexed by a single row key, and columns that are related to one another are typically grouped into a column family
- Each intersection of a row and column can contain multiple cells
  - Each cell contains a unique timestamped version of the data for that row and column

A Bigtable table is sharded into blocks of contiguous rows, called tablets, to help balance the workload of queries
- Tablets are stored on Colossus, Google's file system, in SSTable format
- An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings
- Importantly, data is never stored in Bigtable nodes themselves; each node has pointers to a set of tablets that are stored on Colossus
- Rebalancing tablets from one node to another happens quickly, because the actual data is not copied
- Recovery from the failure of a Bigtable node is fast, because only metadata must be migrated to the replacement node
Each Bigtable zone is managed by a primary process, which balances workload and data volume within clusters
- This process splits busier or larger tablets in half and merges less-accessed/smaller tablets together, redistributing them between nodes as needed
- To get the best write performance from Bigtable, it's important to distribute writes as evenly as possible across nodes
  - One way to achieve this goal is by using row keys that don't follow a predictable order
  - it's useful to group related rows so they are next to one another, which makes it much more efficient to read several rows at the same time
Bigtable compresses your data automatically using an intelligent algorithm
- Random data cannot be compressed as efficiently as patterned data
- Compression works best if identical values are near each other
- Bigtable compresses values that are up to 1 MiB in size
Single-cluster Bigtable instances provide strong consistency
- By default, instances that have more than one cluster provide eventual consistency, but for some use cases they can be configured to provide read-your-writes consistency or strong consistency, depending on the workload and app profile settings
You can manage security at the project, instance, table, or authorized view levels
- Bigtable does not support row-level, column-level, or cell-level security restrictions
By default, all data stored within Google Cloud, including the data in Bigtable tables, is encrypted at rest using the same hardened key management systems that we use for our own encrypted data
Bigtable provides change data capture (CDC) in the form of change streams
- Change streams let you capture and stream out data changes to a table as the changes happen
- You can read a change stream using a service such as Dataflow to support use cases including data analytics, audits, archiving requirements, and triggering downstream application logic

16

You can use Dataproc to run most of your Hadoop jobs on Google Cloud
- Update your job to point to your persistent data stored in Cloud Storage
- Create a Dataproc cluster on which to run your job
- Submit your job to the ephemeral cluster
- Optionally, monitor your job logs using Cloud Logging or Cloud Storage
- Check your job's output on Cloud Storage
- When your job completes, delete the cluster
Dataproc runs Hadoop, so many kinds of jobs are supported automatically
- Hadoop, Spark, Hive, Pig
- Dataproc provides a mechanism called initialization actions, which enables you to customize the software running on the nodes of your cluster
Cloud Storage connector, which is preinstalled on Dataproc cluster nodes, enables your jobs to use Cloud Storage as a Hadoop compatible file system (HCFS)
you create ephemeral clusters when you need them and delete them when your jobs are finished
It can be difficult to calculate the perfect cluster configuration, because each job has its particular needs and idiosyncrasies
- How many nodes to use
- The type of virtual machine to use for your primary (master) node
- The type of virtual machine to use for your worker nodes
- Use the default value of disk and evaluate the results unless you know that your jobs have unusual demands on primary disk usage
  - For particularly disk-intensive jobs, especially those with many individual read and write operations, you might be able to improve operation by adding local SSDs
Consider the inherent unreliability of preemptible nodes before choosing to use them
- in general, the more preemptible nodes you use relative to standard nodes, the higher the chances are that the job won't have enough nodes to complete the task

Direct data access: Store your data in Cloud Storage and access it directly
HDFS compatibility: You can easily access data in Cloud Storage using the gs:// prefix
Interoperability: seamless interoperability between Spark, Hadoop, and Google services
Data accessibility: When you shut down a Hadoop cluster, you continue to have access to your data in Cloud Storage
High data availability: Data stored in Cloud Storage is highly available and globally replicated
No storage management overhead: Cloud Storage requires no routine maintenance
Quick startup: With Cloud Storage, you can start your job as soon as the task nodes start, which leads to significant cost savings over time

17

Compute
- Auto Scaling
  - it makes sense to have more aggressive upscale configurations for clusters running business critical applications/jobs while one for those running low priority jobs may be less aggressive
  - Dataproc auto scaling depends on YARN metrics - Allocated, Available and Pending memory
  - Scaling down can be less straightforward than scaling up and can result in task reprocessing or job failures
  - Preferably scale only the secondary workers (the ones without data nodes)
    - This would eliminate the need to move HDFS from the nodes being deleted
  - Graceful decommission should ideally be set to be longer than the longest running job on the cluster
  - Use Enhanced Flexibility Mode (EFM) to enable the use of more aggressive auto scaling policies
- Preemptible VMs
  - Use PVMs only for secondary workers as they do not run HDFS
  - Set upto less than 30% of the max secondary workers to be PVMs
  - Use PVMs only for fault tolerant jobs and test rigorously on lower level environments before upgrading to Prod
  - Increase Application (MapReduce/Spark/etc) fault tolerance by increasing maximum attempts of application master and task/executor as required
Storage
- GCS, HDFS, PD, Zonal disks, Local SSD
- Performance considerations
  - same region
  - disable auto.purge
  - consider tuning the Spark SQL/RDD/DataFrame parameters to reduce the possibility of many files with varying/small sizes being written to GCS
- Enhanced Flexibility Mode
  - Removal of worker nodes due to downscaling or preemption often result in the loss of shuffle (intermediate data) stored locally on the node
  - To minimize job delays in such scenarios, it is highly recommended to enable Enhanced Flexibility Mode on the cluster
  - Primary-worker shuffle: Recommended for Spark jobs, this enables mappers to write data to primary workers
  - HCFS (Hadoop Compatible File System) shuffle: Mappers write data to an HCFS implementation (HDFS by default)
Operations
- Dataproc Labels
  - Billing: It is possible to track and consolidate costs associated with Dataproc clusters for the purpose of attribution to users, teams or departments
  - Searching and Categorizing
  - Dynamically submit job/workflows to Dataproc cluster pools based on cluster or job labels
  - Monitoring
  - You can perform rolling upgrades of clusters using labels
- Cluster pools
  - pools can be assigned Dataproc Workflow Templates
  - This enables you to submit jobs using Workflow Templates to cluster pools
- Cluster scheduled deletion

18

The training set doesn't adequately represent real life data (or the validation set or test set)

The model is too complex

19

Sensitive Data Protection uses information types—or infoTypes—to define what it scans for
- An infoType is a type of sensitive data, such as a name, email address, telephone number, identification number, credit card number, and so on
- Every infoType defined in Sensitive Data Protection has a corresponding detector
- When you set up Sensitive Data Protection to scan your content, you include the infoType detectors to use in the scan configuration
- When you set up Sensitive Data Protection to scan your content, you include the infoType detectors to use in the scan configuration
- If you don't specify any infoTypes, Sensitive Data Protection uses a default infoTypes list
Built-in infoType detectors are built into Sensitive Data Protection
- They include detectors for country- or region-specific sensitive data types as well as globally applicable data types
Custom infoType detectors are detectors that you create yourself
- Regular custom dictionary detectors are simple word lists that Sensitive Data Protection matches on
- Stored custom dictionary detectors are generated by Sensitive Data Protection using large lists of words or phrases stored in either Cloud Storage or BigQuery
- Regular expressions (regex) detectors enable Sensitive Data Protection to detect matches based on a regular expression pattern
- Sensitive Data Protection includes the concept of inspection rules, which enable you to fine-tune scan results using the following
  - Exclusion rules enable you to decrease the number of findings returned by adding rules to a built-in or custom infoType detector
  - Hotword rules enable you to increase the quantity or change the likelihood value of findings returned by adding rules to a built-in or custom infoType detector
Built-in infoType detectors are not a perfectly accurate detection method

20

The pull mode can use one of the two service APIs, Pull or StreamingPull
the clients rely on the Google Cloud-provided high-level client library that performs streaming pull requests internally and delivers messages asynchronously
For a single streaming pull request, a subscriber client can have multiple responses returned due to the open connection
- In contrast, only one response is returned for each pull request
Where possible, the Pub/Sub client libraries use StreamingPull for maximum throughput and lowest latency
- The client sends a request to the server to establish a connection
- If there is no error or the connection quota is available again, the server continuously sends messages to the connected client
- If or when the throughput quota is exceeded, the server stops sending messages
- The client or the server eventually closes the connection
This API is a traditional unary RPC that is based on a request and response model
- A single pull response corresponds to a single pull request
- The client sends a request to the server for messages. If the throughput quota is exceeded, the server returns a resource exhausted error
- If there is no error or the throughput quota is available again, the server replies with zero or more messages and acknowledgment IDs
- When using the unary Pull API, a response with zero messages or with an error does not necessarily indicate that there are no messages available to receive

In push delivery, Pub/Sub initiates requests to your subscriber application to deliver messages
- Messages are delivered to a publicly addressable server or a webhook, such as an HTTPS POST request
Push subscriptions minimize dependencies on Pub/Sub-specific client libraries and authentication mechanisms
The Pub/Sub server sends each message as an HTTPS request to the subscriber client at a pre-configured endpoint
The endpoint acknowledges the message by returning an HTTP success status code
Pub/Sub dynamically adjusts the rate of push requests based on the rate at which it receives success responses
message
- Wrapped: Pub/Sub sends the message in the JSON body of a POST request
- Unwrapped: Pub/Sub sends the raw message data directly as the HTTP body
If a push subscription uses authentication, the Pub/Sub service signs a JWT and sends the JWT in the authorization header of the push request
To temporarily stop Pub/Sub from sending requests to the push endpoint, change the subscription to pull
- The changeover can take several minutes to take effect
Push backoff
- If a push subscriber sends too many negative acknowledgments, Pub/Sub might start delivering messages using a push backoff
- When Pub/Sub uses a push backoff, it stops delivering messages for a predetermined amount of time
Delivery rate
- Pub/Sub adjusts the number of concurrent push requests using a slow-start algorithm
- The maximum allowed number of concurrent push requests is the push window

21

pull
- use case
  - Large volume of messages (GBs per second)
  - Efficiency and throughput of message processing is critical
  - Environments where a public HTTPS endpoint with a non-self-signed SSL certificate is not feasible to set up
- endpoints
  - Any device on the internet that has authorized credentials is able to call the Pub/Sub API
- load balancing
  - Multiple subscribers can make pull calls to the same "shared" subscription
  - Each subscriber receives a subset of messages
- flow control
  - The subscriber client controls the rate of delivery
  - The subscriber can dynamically modify the acknowledgment deadline, allowing message processing to be arbitrarily long
- efficiency and throughput
  - Achieves high throughput at low CPU and bandwidth by allowing batched delivery, acknowledgments, and massively parallel consumption
  - May be inefficient if aggressive polling is used to minimize message delivery time
push
- use case
  - Multiple topics that must be processed by the same webhook
  - App Engine Standard and Cloud Run functions subscribers
  - Environments where Google Cloud dependencies (such as credentials and the client library) are not feasible to set up
- endpoint
  - An HTTPS server with non-self-signed certificate accessible on the public web
  - The receiving endpoint might be decoupled from the Pub/Sub subscription, so that messages from multiple subscriptions are sent to a single endpoint
  - Endpoints must be reachable using DNS names and have SSL certificates installed
- load balancing
  - Push endpoints can be load balancers
- flow control
  - The Pub/Sub server automatically implements flow control
  - There's no need to handle message flow at the client side
  - However, it's possible to indicate that the client cannot handle the current message load by passing back an HTTP error
- efficiency and throughput
  - Delivers one message per request and limits the maximum number of outstanding messages
When to use an export subscription
- The overhead of running a Dataflow job is not necessary when messages don't require additional processing before being stored
- Simple deployment
- Low costs
- Minimal monitoring
- A BigQuery subscription can use the schema of the topic to which it is attached, which is not available with the basic Dataflow template for writing from Pub/Sub to BigQuery
- Similarly, a Cloud Storage subscription offers configurable file batching options based on file size and elapsed time, which are not configurable in the basic Dataflow template for writing from Pub/Sub to Cloud Storage

22

23

あなたは、財務チームが費用請求をもっと迅速に処理できるように支援するツールを構築しています。従業員は、レシートの画像が添付された、説明が最小限の費用請求を提出します。監査と分析のために、レシートの詳細をキャプチャするにはどうすればよいですか。

Create document processors that help automate tedious tasks, improve data extraction, and gain deeper insights from unstructured or structured document information
Digitize documents using OCR to get text, layout, and various add ons such as image quality detection and deskewing
Extract text and layout information, from document files and normalize entities
Identify key-value pairs (kvp) in structured forms and simple tables
Classify document types to drive downstream processes such as extraction and storage
Split and classify documents by type
Prepare datasets to be used in fine-tuning and model evaluations using auto-labeling, schema management, and dataset management features

Extract text and values from expense documents such as expense date, supplier name, total amount, and currency

24

25

あなたの会社は Cloud SQL を 2 つのリージョンで実行しています。1 つ目のリージョンである us-central1 はエンドユーザーに近く、本番環境での使用頻度は多く、予測可能です。もう一つのリージョンである europe-west1 は開発チームに近く、使用は断続的です。労力、レイテンシ、パフォーマンス面で妥協することなく費用を削減するには、どうすればよいですか。

Committed use discounts
Cloud SQL CUDs give you a 25% discount off on-demand pricing for a one-year commitment and a 52% discount off on-demand pricing for a three-year commitment
Cloud SQL CUDs apply only to vCPUs and memory
Purchase CUDs for regions
Determine the consistent baseline vCPU/RAM usage per project
If you have instances that are run only for bursts or shorter durations, exclude them from your calculations, and pay on-demand rates for those instances

What are the main sources of waste in cloud databases?
- Over-provisioned resources
- Idle resources
- Discounts not leveraged

過去記事

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up