LoginSignup
0
0

More than 3 years have passed since last update.

Migrate table from BigQuery to ADW

Last updated at Posted at 2019-11-06

Before we start

In BigQuery, assuming we have a project called "bq2adw" and in this project, there is a dataset called "bqds02". And in this dataset, we have a table called "T4_1". What we are going to do is moving this table into Oracle Autonomous Data Warehouse.

Obviously, we cannot access BigQuery from ADW(Oracle Autonomous Data Warehouse) directly, so what we have to do first is to export the tables from BigQuery to somewhere. According to the official document, GCS(Google Cloud Storage) is the only place we can export BigQuery tables to. So we don't have any other choice here.

After we export tables to GCS, the easiest way to go is loading the exported files(AVRO or CSV) into ADW directly. Same as what we can do with the Amazon S3 or Azure Blob Storage. Unfortunately, until now, Google Cloud Storage is not in the supported list yet. So we have to move out the exported files from GCS to somewhere ADW can access directly. No doubt, Oracle Object Storage should be the first choice.

That means we need to move the files from GCS to OOS(Oracle Object Storage). Essentially they both are kind of object storage service, so they all provide command-line tools and SDK/libraries to interact with them. Next, we will use python to do all the works from beginning to the end which includes the following steps.

  • Export tables from BigQuery to Google Cloud Storage
  • Download the exported files from Google Cloud Storage to the intermediate server
  • Upload the files from the intermediate server to Oracle Object Storage
  • Load the data in the files from Oracle Object Storage into Oracle Autonomous Data Warehouse

The workflow is shown as below.

image.png

Setting up authentication to Google Cloud Platform

Create service account & generate key file

Visit the Google Cloud Platform console page and go to [IAM & admin] -> [Service accounts] to creaete a service account.

image.png

Click the [CREATE SERVICE ACCOUNT] button, and input the necessary information.

image.png

image.png

Select a role for this service account. For simplicity, we select [owner] here. Then click [Continue]

image.png

Input your Google account

image.png

If you did not create a key in last step, you could click the action icon and click [Create key] to generate one.

image.png

Select JSON format and click [CREATE]

image.png

A private key is generated and you are responsible for storing it safely as any other private key.

image.png

Now the GCP service account has been created and the corresponding key file is also generated. We will configure it and use it in the intermediate server later.

Prepare authentication in the intermediate server

Transfer the key file to the intermediate server or download it from the GCP IAM & account page.
Set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the path of your key file. The Python application will refer to this variable to get authentication and access the GCP services.

export GOOGLE_APPLICATION_CREDENTIALS="[PATH TO THE KEY FILE]"

image.png

Now we should be able to access and manipulate the services in the GCP project.

Export BigQuery tables

Install BigQuery Client Libraries

As we mentioned above, we will use Python script to control the work programmatically. So we need to install the BigQuery Client Libraries to interact with the BigQuery service. In the intermediate server, we already have a workable Python 3.6.8 environment. Let's install the library first.

python -m pip install --upgrade google-cloud-bigquery

After we install the BigQuery library, let's write some Python script to test whether it works.

from google.cloud import bigquery

client = bigquery.Client()

query = (
    "select keyid from `bqds02.T4_1` limit 10"
)
query_job = client.query(
    query,
    # Location must match that of the dataset(s) referenced in the query.
    location="US",
)  # API request - starts the query

for row in query_job:  # API request - fetches results
    # Row values can be accessed by field name or index
    assert row[0] == row.keyid == row["keyid"]
    print(row)

We should see the result as below, that means the library and authentication are working as expected.

image.png

Export BigQuery tables

Now we can start to interact with the BigQuery service. For example... export a table!

image.png

Below is a sample script to export table T4_1 to Google Cloud Storage and set the exported file name as "T4_1_export.avro".

from google.cloud import bigquery

client = bigquery.Client()

# set the basic export configuration
bucket_name = 'bucket-bq2adw'
project = "bq2adw"
dataset_id = "bqds02"
table_id = "T4_1"

destination_uri = "gs://{}/{}".format(bucket_name, "T4_1_export.avro")
dataset_ref = client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)

# Specify the export file format: AVRO
job_config = bigquery.job.ExtractJobConfig()
job_config.destination_format = (bigquery.job.DestinationFormat.AVRO)

extract_job = client.extract_table(
    table_ref,
    destination_uri,
    # Location must match that of the source table.
    location="US",
    job_config=job_config
)  # API request
extract_job.result()  # Waits for job to complete.

print(
    "Exported {}:{}.{} to {}".format(project, dataset_id, table_id, destination_uri)
)

In the terminal, we should see the output of this sample script as below.

image.png

In the GCS bucket, we could also see that the exported file is there.

image.png

Until now, we are able to access BigQuery and export a specific table into Google Cloud Storage bucket. Next, we will download the exported file to the intermediate server.

Download the exported table file

In this step, we will still use Python script to do the work, so to interact with Google Cloud Storage service, we need to install the client library first.

Install Google Cloud Storage Client Libraries

The installing is very easy, perform below command should be fine.

python -m pip install --upgrade google-cloud-storage

Use the below sample script, we can download the exported AVRO file of table T4_1 to the intermediate server.

from google.cloud import storage

client = storage.Client()


def download_blob(bucket_name, source_blob_name, destination_file_name):
    """Downloads a blob from the bucket."""
    # storage_client = storage.Client()
    bucket = client.get_bucket(bucket_name)
    blob = bucket.blob(source_blob_name)

    blob.download_to_filename(destination_file_name)

    print('Blob {} downloaded to {}.'.format(
        source_blob_name,
        destination_file_name))


if __name__ == "__main__":
    download_blob('bucket-bq2adw', 'T4_1_export.avro', 'T4_1_export.avro')

We can see that the AVRO file is downloaded successfully.

image.png

Upload the files to Oracle Object Storage

Until now, the interactions with Google Cloud Platform have been completed, we will need to interact with Oracle Cloud services, including Object Storage and Autonomous Data Warehouse. This feels like a reverse way of the previous steps. To interact with Cloud services, the first thing to do is set up authentication.

Setting up authentication

In this step, we will involve the OCI(Oracle Cloud Infrastructure) Python SDK. Although we don't use OCI CLI for now, the easiest way to set up the OCI authentication is following the OCI Quickstart tutorial. By following this tutorial, you can easily install the OCI CLI and set up the authentication to access Oracle Object Storage service.

Once the CLI is installed and the authentication configuration is completed. We can perform the following command to verify how it works.

# get the namespace of the object storage with the authentication you have set
oci os ns get

If you get a response as below, the setting up of OCI authentication work is complete.

image.png

Install OCI Python SDK

It is super simple! Same as installing any other Python libraries.

python -m pip install --upgrade oci

If you execute following sample script, you should get the same namespace as the response you got from OCI CLI command oci os ns get

import oci

# use default config file: $HOME/.oci/config
config = oci.config.from_file()

compartment_id = "[OCID OF THE COMPARTMENT]"
client = oci.object_storage.ObjectStorageClient(config)

namespace = client.get_namespace().data
print(namespace)

image.png

Upload the exported file to Oracle Object Storage

The authentication set up is done, SDK installation is also done. We have everything to interact with the Oracle Object Storage service now.
Using the following sample script, we should be able to upload the file T4_1_export.avro onto Object Storage.

import oci

# use default config file: $HOME/.oci/config
config = oci.config.from_file()

client = oci.object_storage.ObjectStorageClient(config)

namespace = client.get_namespace().data
bucket_name = "bq-tables"

# Then upload the file to Object Storage
file_name = 'T4_1_export.avro'
upload_file_name = file_name
with open("./"+file_name, 'rb') as f:
    obj = client.put_object(namespace, bucket_name, upload_file_name, f)
print("Uploading {}".format(file_name))

Execute the script, we should see the output as below.

image.png

In the Oracle Object Storage bucket, We can see the exported AVRO file T4_1_export.avro is there.

image.png

Now, we can move to the next and final step - loading data into ADW.

Using rclone to transfer the data

Besides using the SDK of GCS and OCI Object Storage to transfer the data, we could also use the open-source command-line utility rclone to synchronize the data from GCS to OCI Object Storage.

In this post, since we mainly talk about how to use the SDK to complete the migration job, so we will not dig into the details of the rclone approach. You could refer to the Oracle White Paper Transferring Data to Object Storage from Other Cloud Providers or Local File Systems for now. Although the Google Cloud Storage is not covered in this white page, the method is similar. In addition, there will be another post to cover the detailed steps of how to implement the data synchronization from GCS to OCI Object Storage by using rclone. Once the post is completed, I will update it here.

Load data into Oracle Autonomous Data Warehouse

Actually loading data from Oracle Object Storage into ADW is mainly interactions between them, the intermediate server is taking the role of commander. Once we send some commands to the ADW instance, ADW will connect to the Object Storage and pull(load) the target data into the database instance.
Same as any other interaction with cloud services, we need to set up authentication first.

Create credential to access Object Storage

Assuming we already provisioned an ADW instance called ADW02. To load data from Object Storage into ADW instance, we need to make sure the ADW instance could access Object Storage first, therefore we need to create a credential in ADW.

The credential is created with the OCI user name and the generated Auth Token, so please generate a token before creating the credential.

image.png

Then, login to the ADW instance with the target user name, use following procedure to create the credential.

SET DEFINE OFF
BEGIN
  DBMS_CLOUD.CREATE_CREDENTIAL(
    credential_name => 'ADW02_CRED',
    username => '[OCI USER NAME]',
    password => '[USER AUTH TOKEN]'
  );
END;
/

For detailed information regarding this topic, please refer to the ADW official document - Load Data from Files in the Cloud

Loading data

If you have gone through the document Load Data from Files in the Cloud, you should already know that to load AVRO file into ADW instance, we need to use the provided procedure DBMS_CLOUD.COPY_DATA. Since we want this work could be controlled programmatically and integrated with other steps we mentioned above, we will involve the cx_Oracle library to perform the SQL statements.

Install the cx_Oracle library

python -m pip install --upgrade cx_Oracle

In addition, please note that we need to ensure that the target table(T4_AVRO in this sample) must be created before we do the loading. So perform following SQL in advance.

create table T4_AVRO(
    keyid   number(19),
    col_01  binary_double,
    col_02  binary_double
)

Let's review the prerequisites of loading data into the ADW instance.

  • create a credential to access Object Storage - done
  • create the table before loading - done

Now, we could perform the following sample script to load the AVRO file data stored in Oracle Object Storage into the target ADW instance.

import cx_Oracle

def load_data_from_oos(connection):

    sql_copy_data = """BEGIN
    DBMS_CLOUD.COPY_DATA(
        table_name =>'T4_AVRO',
        credential_name =>'[CREDENTIAL NAME]',
        file_uri_list =>'[OBJECT STORAGE URI OF FILE]',
        format => '{
            "type": "avro",
            "schema": "all"
        }'
    );
    END;"""

    sql_count = """select count(1) from T4_AVRO"""

    sql_truncate = """truncate table T4_AVRO"""

    with connection.cursor() as cur:

        cur.execute(sql_truncate)

        cur.execute(sql_count)
        cnt = cur.fetchone()[0]
        print("Before: {}".format(cnt))

        cur.execute(sql_copy_data)

        cur.execute(sql_count)
        cnt = cur.fetchone()[0]
        print("After:  {}".format(cnt))


if __name__ == "__main__":
    con = cx_Oracle.connect(user='[USER]', password='[PASSWORD]', dsn='[ADW SERVICE NAME]')

    load_data_from_oos(con)

We should see an output like below. That means all the 200000 records we exported from BigQuery table have been loaded into ADW instance.

image.png

This could be confirmed from SQLPlus as well, although we already checked the count in the script.

image.png

Until now, we have completed all the single steps to migrate a table from BigQuery to ADW.

  • export table from BigQuery to Google Cloud Storage
  • use an intermediate server to transfer the exported file from Google Cloud Storage to Oracle Object Storage
  • load data from Oracle Object Storage into ADW instance

Make the table migration be an integrated job

As shown above, we can complete every single step with Python code, so next, we will integrate these steps and make the table migration be an end-to-end job.

Below is the log of migrating 3 tables(T4_1, T4_2, T4_3) by using the integrated code.


2019-11-06 07:19:18,372 - __main__ - INFO - =================================================
2019-11-06 07:19:18,373 - __main__ - INFO - Start to migrate table: T4_1
2019-11-06 07:19:25,315 - bq - INFO - Exported bq2adw:bqds02.T4_1 to gs://bucket-bq2adw/T4_1.csv.gz
2019-11-06 07:19:26,572 - gcs - INFO - Blob T4_1.csv.gz downloaded to /var/tmp/bq2adw/src/tmp_files/T4_1.csv.gz.
2019-11-06 07:19:26,766 - oos - INFO - Uploaded /var/tmp/bq2adw/src/tmp_files/T4_1.csv.gz
2019-11-06 07:19:26,937 - adw - INFO - T4_1: before loading count: 0
2019-11-06 07:19:29,824 - adw - INFO - T4_1:  after loading count: 200000
2019-11-06 07:19:29,825 - __main__ - INFO - Spent 11.45 seconds to complete the migration of table: T4_1
2019-11-06 07:19:29,825 - __main__ - INFO - =================================================
2019-11-06 07:19:29,825 - __main__ - INFO - Start to migrate table: T4_2
2019-11-06 07:19:35,477 - bq - INFO - Exported bq2adw:bqds02.T4_2 to gs://bucket-bq2adw/T4_2.csv.gz
2019-11-06 07:19:36,444 - gcs - INFO - Blob T4_2.csv.gz downloaded to /var/tmp/bq2adw/src/tmp_files/T4_2.csv.gz.
2019-11-06 07:19:36,579 - oos - INFO - Uploaded /var/tmp/bq2adw/src/tmp_files/T4_2.csv.gz
2019-11-06 07:19:36,702 - adw - INFO - T4_2: before loading count: 0
2019-11-06 07:19:39,528 - adw - INFO - T4_2:  after loading count: 200000
2019-11-06 07:19:39,528 - __main__ - INFO - Spent 9.7 seconds to complete the migration of table: T4_2
2019-11-06 07:19:39,529 - __main__ - INFO - =================================================
2019-11-06 07:19:39,529 - __main__ - INFO - Start to migrate table: T4_3
2019-11-06 07:19:44,353 - bq - INFO - Exported bq2adw:bqds02.T4_3 to gs://bucket-bq2adw/T4_3.csv.gz
2019-11-06 07:19:45,413 - gcs - INFO - Blob T4_3.csv.gz downloaded to /var/tmp/bq2adw/src/tmp_files/T4_3.csv.gz.
2019-11-06 07:19:45,561 - oos - INFO - Uploaded /var/tmp/bq2adw/src/tmp_files/T4_3.csv.gz
2019-11-06 07:19:45,683 - adw - INFO - T4_3: before loading count: 0
2019-11-06 07:19:48,382 - adw - INFO - T4_3:  after loading count: 200000
2019-11-06 07:19:48,382 - __main__ - INFO - Spent 8.85 seconds to complete the migration of table: T4_3

References

BigQuery Client Libraries
Exporting table data
Python Client for Google BigQuery
Cloud Storage Client Libraries
Downloading objects
Python Client for Google Cloud Storage
Load Data from Files in the Cloud
Transferring Data to Object Storage from Other Cloud Providers or Local File Systems

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0