Processing Time Windowed Data with Apache Beam and Cloud Dataflow (Java)
In this lab you will learn how to deploy a Java application using Maven to process data with Cloud Dataflow. This lab uses a Java application that implements time-windowed aggregation to augment the raw data in order to produce a consistent training and test datasets that can be used to refine you rmachine learning models in later labs.
The base data set that is used provides historical information about internal flights in the United States retrieved from the US Bureau of Transport Statistics website. This data set can be used to demonstrate a wide range of data science concepts and techniques and is used in all of the other labs in the Data Science on the Google Cloud Platform and Data Science on Google Cloud Platform: Machine Learning quests.
Time-windowed aggregate data sets are useful because they allow you to improve the accuracy of data models where behavior changes in a regular or semi-regular pattern over a period of time. For example with flights you know that, in general, taxi-out times increase during peak hour periods. You also know from earlier labs the arrival delay time that you are interested in modelling throughout these labs varies in response to taxi-out time. Airlines are aware of this too and adjust their scheduled arrival times to account for the average taxi-out time expected at the scheduled departure time at each airport. By computing time-windowed aggregate data you can more accurately model whether a given flight will be delayed by identify parameters, such as taxi-out time, that genuinely exceed the average for that time window.
Using Apache Beam to create these aggregate data sets is useful because it can be used in batch mode to create the training and test data sets using historical data but the same code can then be used in streaming mode to compute averages on real-time streaming data. This ability to use the same code helps mitigate any training-serving skew that could arise if a different language or platform was used to process the historical data and the streaming data.
Cloud Dataflow is a fully-managed service for transforming and enriching data in stream (real time) and batch (historical) modes via Java and Python APIs with the Apache Beam SDK. Cloud dataflow provides a serverless architecture that can be used to shard and process very large batch data sets, or high volume live streams of data, in parallel.
Google BigQuery is a RESTful web service that enables interactive analysis of massively large datasets working in conjunction with Google Storage.
Configure the Maven Apache project object model file
Deploy the Java application to Apache Beam to create the aggregate training and test data files
Join Qwiklabs to read the rest of this lab...and more!
- Get temporary access to the Google Cloud Console.
- Over 200 labs from beginner to advanced levels.
- Bite-sized so you can learn at your own pace.
Check that Cloud Dataflow project called CreateTraining DataSet is running.
Check that the training and test data sets have been written to Cloud Storage
Check that flights.testFlights table exists in BigQuery
Check that flights.trainFlights table exists in BigQuery