LoginSignup
1
1

More than 3 years have passed since last update.

Create and Manage Dataset with Azure Machine Learning

Last updated at Posted at 2020-02-03

Overview

In recent years, MLOps has grown in popularity as many people move ML code to production.
As one of the cloud providers, Microsoft leads the way by providing Azure Machine Learning Service to enable MLOps at ease.
By using Azure ML Service we can do several things including the following tasks:

  1. create and manage datasets
  2. train machine learning models
  3. deploy and serve models as web service

In this post, we will focus on how to create and manage datasets with Azure ML Service.
Since I mainly use python to write code, I will discuss more on the use of python SDK.

Create Workspace

To use Azure ML Service, we need to create a Workspace.
Workspace is a top-level resource of Azure ML that provides centralized place to work with all the artifacts that are used in machine learning projects, such as dataset, trained model, experiment, etc.
To create a workspace, we can use the Azure Portal. For more details, you can follow the link below.
https://docs.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace

After the Workspace is created, we manage the object with Python SDK.
You will need authentication to get the Workspace object.

If you are using Jupyter notebook you can use InteractiveLoginAuthentication by default using the following code. The sample below will prompt us with Azure authentication.

from azureml.core import Workspace

ws = Workspace.get(
  name="<my-workspace-name>",
  subscription_id="<azure-subscription-id>",
  resource_group="<my-resource-group>"
)

To automate things, we can't rely on prompt authentication. We may use ServicePrincipalAuthentication instead. First, we have to create an Azure AD app and grant it access to manage our resource group.
For more details, you can read the article below.
https://docs.microsoft.com/en-us/azure/active-directory/develop/howto-create-service-principal-portal

From this point, we will have these items:

  1. TENANT_ID
  2. SERVICE_PRINCIPAL_ID (CLIENT_ID)
  3. SERVICE_PRINCIPAL_SECRET (CLIENT_SECRET)

Then, we can smoothly access the Workspace without the need to do login interactively.

from azureml.core import Workspace
from azureml.core.authentication import ServicePrincipalAuthentication

svc_pr = ServicePrincipalAuthentication(
            tenant_id="<TENANT_ID>",
            service_principal_id="<CLIENT_ID>",
            service_principal_password="<CLIENT_SECRET>"
         )

ws = Workspace.get(
        name="<my-workspace-name>",
        subscription_id="<azure-subscription-id>",
        resource_group="<my-resource-group>",
        auth=svc_pr
)

By this point, we can safely assume that we can access the Workspace in our application or automated task scripts without the need to login manually.

Create Azure Blob Storage

To work with data in Azure ML, we need to connect Datastore to the Workspace.
Although these list by no means exhaustive, Azure ML Workspace support these Datastore.

  1. Azure Blob Storage
  2. Azure File Storage
  3. Azure SQL Database
  4. Azure Data Lake Storage

As for now, we will be using Azure Blob Storage as our Datastore.
To create Azure Blob Storage we can use Azure Portal.
For more details, you can read the article below.
https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-portal

Register Blob Storage Container to Workspace

After the Container is created, we need to attach it as Datastore for our Workspace.
To do so, we can use the following code to register our container as Datastore named <my-datastore>.

from azureml.core import Datastore

datastore = Datastore.register_azure_blob_container(
      workspace=ws,
      datastore_name="<my-datastore>",
      container_name="<container-name>",
      account_name="<AZURE_STORAGE_ACCOUNT>",
      account_key="<AZURE_STORAGE_ACCOUNT_KEY>"
)

As we can see, we need a storage account and storage account key. We can get this information from the Azure Portal.

After we register the Datastore to the Workspace, we can access it later with the following code.

from azureml.core import Datastore

datastore = Datastore.get(ws, datastore_name="<datastore-name>")

Upload CSV File to Blob Storage Container

For later use, let's upload a CSV file to the Container from Azure Portal.
Note that we still can use this storage as the backend of our web application and other tasks.
This means we can easily integrate our application data and Azure ML Service.

Read Dataset from Datastore

Next, we can read our CSV file easily with the Dataset class in the following code.

from azureml.core import Dataset

dataset = Dataset.Tabular.from_delimited_files(
  path=(datastore, "<csv-filename>"),
  separator=","
)

By this point, we will have metadata for our CSV file and ready to be registered to our Workspace.
To ensure we are using the right csv file, we can convert the dataset to dataframe and print its contents. For example, we can use the following code to do so.

from azureml.core import Dataset

df = dataset.to_pandas_dataframe()
print(df.head())

Register Dataset to Workspace

After reading the Dataset from the Datastore, we can then register it to our Workspace for later use.
By registering the Dataset to the Workspace, Azure ML will manage the version of the Dataset automatically. Furthermore, in the training phase, we also can securely use the dataset without the need to authenticate.

We can use the following code to do this.

df = dataset.register(workspace=ws, name="<dataset_name>")

Use Dataset in Training

Once we finish registering our dataset, we then can access the Dataset from training script.

# train.py
from azureml.core import Dataset, Run

run = Run.get_context()
ws = run.experiment.workspace

dataset = Dataset.get_by_name(workspace=ws, name="<dataset_name>")
df = dataset.to_pandas_dataframe()

# Here we can use df to train our model

You have to note that the above script can be conducted inside the training environment managed by Azure ML Service. If you just want to casually access Dataset from your local computer, you need to authenticate with interactive login. When you call dataset.to_pandas_dataframe() you will be prompted with login.

Conclusion

By reading this post, you are expected to understand the following concepts.

  1. What Azure ML Workspace is and how to create and manage them using Azure Portal
  2. How to access the Workspace from code using Azure ML Python SDK
  3. How to register Azure Blob Storage as Datastore to Workspace using Azure ML Python SDK
  4. How to use file from Datastore and convert them to Dataset using Azure ML Python SDK
  5. How to register the Dataset to the Workspace using Azure ML Python SDK
  6. Finally, how to user Dataset in training script using Azure ML Python SDK
1
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
1