arrow_back

Exploring the Lineage of Data with Cloud Data Fusion

加入 登录
Test and share your knowledge with our community!
done
Get access to over 700 hands-on labs, skill badges, and courses

Exploring the Lineage of Data with Cloud Data Fusion

Lab 1 小时 30 分钟 universal_currency_alt 7 积分 show_chart 高级
Test and share your knowledge with our community!
done
Get access to over 700 hands-on labs, skill badges, and courses

GSP812

Google Cloud Self-Paced Labs logo

Overview

This lab will show you how to use Cloud Data Fusion to explore data lineage: The data's origins and its movement over time.

Cloud Data Fusion data lineage helps you:

  • Detect the root cause of bad data events.
  • Perform an impact analysis prior to making data changes.

Cloud Data Fusion provides lineage at the dataset level and field level, as well as being time-bound to show lineage over time.

  • Dataset level lineage shows the relationship between datasets and pipelines in a selected time interval.
  • Field level lineage shows the operations that were performed on a set of fields in the source dataset to produce a different set of fields in the target dataset.

For the purpose of this lab, you will use two pipelines that demonstrate a typical scenario in which raw data is cleaned then sent for downstream processing. This data trail -- from raw data to the cleaned shipment data to analytic output -- can be explored using the Cloud Data Fusion lineage feature.

Note: Currently, the Cloud Data Fusion Lineage feature is only available with the Cloud Data Fusion Enterprise Edition.

Objectives

In this lab, you will explore how to:

  • Run sample pipelines to produce lineage.
  • Explore dataset and field level lineage.
  • Learn how to pass handshaking information from the upstream pipeline to the downstream pipeline.

Setup and requirements

For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.

  1. Sign in to Google Cloud Skills Boost using an incognito window.

  2. Note the lab's access time (for example, 02:00:00), and make sure you can finish within that time.
    There is no pause feature. You can restart if needed, but you have to start at the beginning.

  3. When ready, click Start lab.

    Note: Once you click Start lab, it will take about 15 - 20 minutes for the lab to provision necessary resources and create a Data Fusion instance. During that time, you can read through the steps below to get familiar with the goals of the lab.

    When you see lab credentials (Username and Password) in the left panel, the instance is created and you can continue logging into the console.
  4. Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud console.

  5. Click Open Google console.

  6. Click Use another account and copy/paste credentials for this lab into the prompts.
    If you use other credentials, you'll receive errors or incur charges.

  7. Accept the terms and skip the recovery resource page.

Note: Do not click End lab unless you have finished the lab or want to restart it. This clears your work and removes the project.

Log in to Google Cloud Console

  1. Using the browser tab or window you are using for this lab session, copy the Username from the Connection Details panel and click the Open Google Console button.
Note: If you are asked to choose an account, click Use another account.
  1. Paste in the Username, and then the Password as prompted.
  2. Click Next.
  3. Accept the terms and conditions.

Since this is a temporary account, which will last only as long as this lab:

  • Do not add recovery options
  • Do not sign up for free trials
  1. Once the console opens, view the list of services by clicking the Navigation menu (Navigation menu icon) at the top-left.

Navigation menu

Activate Cloud Shell

Cloud Shell is a virtual machine that contains development tools. It offers a persistent 5-GB home directory and runs on Google Cloud. Cloud Shell provides command-line access to your Google Cloud resources. gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab completion.

  1. Click the Activate Cloud Shell button (Activate Cloud Shell icon) at the top right of the console.

  2. Click Continue.
    It takes a few moments to provision and connect to the environment. When you are connected, you are also authenticated, and the project is set to your PROJECT_ID.

Sample commands

  • List the active account name:
gcloud auth list

(Output)

Credentialed accounts: - <myaccount>@<mydomain>.com (active)

(Example output)

Credentialed accounts: - google1623327_student@qwiklabs.net
  • List the project ID:
gcloud config list project

(Output)

[core] project = <project_ID>

(Example output)

[core] project = qwiklabs-gcp-44776a13dea667a6 Note: Full documentation of gcloud is available in the gcloud CLI overview guide.

Check project permissions

Before you begin working on Google Cloud, you must ensure that your project has the correct permissions within Identity and Access Management (IAM).

  1. In the Google Cloud console, on the Navigation menu (Navigation menu icon), click IAM & Admin > IAM.

  2. Confirm that the default compute Service Account {project-number}-compute@developer.gserviceaccount.com is present and has the editor role assigned. The account prefix is the project number, which you can find on Navigation menu > Cloud overview.

Default compute service account

If the account is not present in IAM or does not have the editor role, follow the steps below to assign the required role.

  1. In the Google Cloud console, on the Navigation menu, click Cloud overview.

  2. From the Project info card, copy the Project number.

  3. On the Navigation menu, click IAM & Admin > IAM.

  4. At the top of the IAM page, click Add.

  5. For New principals, type:

{project-number}-compute@developer.gserviceaccount.com

Replace {project-number} with your project number.

  1. For Select a role, select Basic (or Project) > Editor.

  2. Click Save.

Prerequisites

In this lab, you will work with two pipelines:

  • The Shipment Data Cleansing pipeline, which reads raw shipment data from a small sample dataset and applies transformations to clean the data.
  • The Delayed Shipments USA pipeline, which reads the cleansed shipment data, analyzes it, and finds shipments within the USA that were delayed by more than a threshold.

Use the Shipment Data Cleansing and Delayed Shipments USA links to download these sample datasets to your local machine.

Task 1. Add the necessary permissions for your Cloud Data Fusion instance

  1. In the Google Cloud console, from the Navigation menu select Data Fusion > Instances.
Note: Creation of the instance takes around 10 minutes. Please wait for it to be ready.

Next, you will grant permissions to the service account associated with the instance, using the following steps.

  1. From the Google Cloud console, navigate to the IAM & Admin > IAM.

  2. Confirm that the Compute Engine Default Service Account {project-number}-compute@developer.gserviceaccount.com is present, copy the Service Account to your clipboard.

  3. On the IAM Permissions page, click +Grant Access.

  4. In the New principals field paste the service account.

  5. Click into the Select a role field and start typing "Cloud Data Fusion API Service Agent", then select it.

  6. Click Save.

Click Check my progress to verify the objective. Add Cloud Data Fusion API Service Agent role to service account

Grant service account user permission

  1. In the console, on the Navigation menu, click IAM & admin > IAM.

  2. Select the Include Google-provided role grants checkbox.

  3. Scroll down the list to find the Google-managed Cloud Data Fusion service account that looks like service-{project-number}@gcp-sa-datafusion.iam.gserviceaccount.com and then copy the service account name to your clipboard.

Google-managed Cloud Data Fusion service account listing

  1. Next, navigate to the IAM & admin > Service Accounts.

  2. Click on the default compute engine account that looks like {project-number}-compute@developer.gserviceaccount.com, and select the Permissions tab on the top navigation.

  3. Click on the Grant Access button.

  4. In the New Principals field, paste the service account you copied earlier.

  5. In the Role dropdown menu, select Service Account User.

  6. Click Save.

Task 2. Open the Cloud Data Fusion UI

  1. In the Console, return to Navigation menu > Data Fusion, then click the View Instance link next to your Data Fusion instance. Select your lab credentials to sign in. If prompted to take a tour of the service, click on No, Thanks. You should now be in the Cloud Data Fusion UI.

  2. Click Studio from the left navigation panel to open the Cloud Data Fusion Studio page.

Cloud Fusion Studio UI

Task 3. Import, deploy, and run the Shipment Data Cleansing pipeline

  1. Next, you need to import the raw Shipping Data. Click Import in the top-right of the Studio page, then select and import the Shipment Data Cleansing pipeline that you downloaded earlier.
Note: If a pop-up asks you to upgrade pipeline plugins, click Fix All to upgrade the plugins to the latest versions.

Shipment Data Cleansing pipeline

  1. Now deploy the pipeline. Click Deploy in the top-right of the Studio page. After deployment, the Pipeline page opens.

  2. Click Run at the top of the center on the Pipeline page to run the pipeline.

Click Check my progress to verify the objective. Import, Deploy and Run Shipment Data Cleansing pipeline

Task 4. Import, deploy, and run the Delayed Shipments data pipeline

After the status of the Shipping Data Cleansing shows Succeeded, you will proceed to import and deploy the Delayed Shipments USA data pipeline that you downloaded earlier.

  1. Click Studio from the left navigation panel to return to the Cloud Data Fusion Studio page.

  2. Click Import in the top right of the Studio page, then select and import the Delayed Shipments USA data pipeline that you downloaded earlier.

Note: If a pop-up asks you to upgrade pipeline plugins, click Fix All to upgrade the plugins to the latest versions.
  1. Deploy the pipeline by clicking Deploy in the top right of the Studio page. After deployment, the Pipeline page opens.

  2. Click Run in the top center of the Pipeline page to run the pipeline.

After this second pipeline successfully completes, you can continue to perform the remaining steps below.

Click Check my progress to verify the objective. Import, Deploy, and Run the Delayed Shipments data pipeline

Task 5. Discover some datasets

You must discover a dataset before exploring its lineage.

  1. Select Metadata from the Cloud Data Fusion UI left navigation panel to open the metadata Search page.
  2. Since the Shipment Data Cleansing dataset specified "Cleaned-Shipments" as the reference dataset, enter shipment in the Search box. The search results include this dataset.

Cleaned shipments metadata search results

Task 6. Use tags to discover datasets

A Metadata search discovers datasets that have been consumed, processed, or generated by Cloud Data Fusion pipelines. Pipelines execute on a structured framework that generates and collects technical and operational metadata. The technical metadata includes dataset name, type, schema, fields, creation time, and processing information. This technical information is used by the Cloud Data Fusion metadata search and lineage features.

Although the Reference Name of sources and sinks is a unique dataset identifier and an excellent search term, you can use other technical metadata as search criteria, such as a dataset description, schema, field name, or metadata prefix.

Cloud Data Fusion also supports the annotation of datasets with business metadata, such as tags and key-value properties, which can be used as search criteria. For example, to add and search for a business tag annotation on the Raw Shipping Data dataset:

  1. Select Metadata from the Cloud Data Fusion UI left navigation panel to open the metadata Search page.

  2. Enter Raw shipping data in the search page of metadata option.

  3. Click on Raw_Shipping_Data.

  4. Under Business tags, click + then insert a tag name (alphanumeric and underscore characters are allowed) and press Enter.

Business tags name field

You can perform a search on a tag by clicking the tag name or by entering tags: tag_name in the search box on the Metadata search page.

Task 7. Explore data lineage

Dataset level lineage

  1. Select Metadata from the Cloud Data Fusion UI left navigation panel to open the metadata Search page, and enter shipment in the Search box.

  2. Click on the Cleaned-Shipments dataset name listed on the Search page.

  3. Then click the Lineage tab. The lineage graph shows that this dataset was generated by the Shipments-Data-Cleansing pipeline, which had consumed the Raw_Shipping_Data dataset.

Cloud Data Fusion Lineage tab

Field level lineage

Cloud Data Fusion field level lineage shows the relationship between the fields of a dataset and the transformations that were performed on a set of fields to produce a different set of fields. Like dataset level lineage, field level lineage is time-bound, and its results change with time.

  1. Continuing from the Dataset level lineage step, click the Field Level Lineage button in the top right of the Cleaned Shipments dataset-level lineage graph to display its field level lineage graph.

Cloud Data Fusion field level lineage

  1. The field level lineage graph shows connections between fields. You can select a field to view its lineage. Select View, then select Pin field to view that field's lineage only.

Data Fusion pin field lineage selection

  1. Locate the time_to_ship field under the Cleaned-Shipments dataset, Select View, then select View impact to perform a impact analysis.

view impact

The field level lineage shows how this field has tansformed over time. Notice the transformations for the time_to_ship field: (i)converting it to a float type column, (ii) determining whether the value is redirected into the next node or down the error path.

The lineage exposes the history of changes the particular field has gone through. Other examples include concatenation of few fields to compose a new field (like first name and last name combined to produce name, or computations done on a field (like converting a number to a percentage against total count).

The cause and impact links show the transformations performed on both sides of a field in a human-readable ledger format.

Congratulations!

In this lab, you have learned how to explore the lineage of your data. This information can be essential for reporting and governance. It can help different audiences understand how data came about to be in the state it is in.

Continue your quest

This self-paced lab is part of the Building Advanced Codeless Pipelines on Cloud Data Fusion quest. A quest is a series of related labs that form a learning path. Completing this quest earns you a badge to recognize your achievement. You can make your badge or badges public and link to them in your online resume or social media account. Enroll in this quest and get immediate completion credit. Refer to the Google Cloud Skills Boost catalog for all available quests.

Explore other quests

Next, it is suggested that you continue with Building Codeless Pipelines on Cloud Data Fusion.

Manual Last Updated November 14, 2022

Lab Last Tested August 08, 2023

Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.