SQL Warehousing Tutorial

SQL Warehousing Tutorial

Note : this is not original material, merely an extract of the IBM Tutorial Documents omitting the procedures to have a slightly higher-level view of the tutorial before (and while) performing it.

Module 1: Designing the physical data model for your data warehouse

In this module, you will connect to the GSDB database and create a physical data model for the new data mart that you will build. You will create a MARTS schema in the data model and then update the GSDB database with the changes.

In this module, you prepare to create a data warehousing application, which is a data mart that will allow the managers to track sales for the fictional Great Outdoors company by store location and over time. As a starting point, you create a warehousing project. You reverse engineer a physical data model from the GSDB database so that you can build data flows that reference objects in the GSDB database. Then you extend the physical data model to add the MARTS schema and three new tables: MARTS.BRANCH_LOCATION, MARTS.TIME_DIMENSION, and MARTS.ORDER_FACT.

After you create the model, the schema, and the tables, you propagate the changes in the MARTS schema to the GSDB database so that you can create data flows to extract the data from the database and populate the new tables.

Learning objectives

After you complete the lessons in this module, you will understand the concepts and know how to do the following tasks:

Establish a connection to the database
Create a data warehousing project in the Design Studio
Create a physical data model
Create a schema and three new tables
Update the database

Time required

This module should take approximately 30 minutes to complete.

Prerequisites

You must first complete the prerequisites described in Setting up the GSDB sample database.

Note: The steps in this tutorial are designed to work with a new copy of the sample database and data. If you or someone else previously worked on this tutorial or another tutorial that uses the GSDB sample database, you need to run the script again to reset the sample database.

Creating a data warehousing project and physical data model for the GSDB database

In this lesson, you connect to the GSDB database, which is the database that contains sales data for the fictional Great Outdoors company. You then create a data warehousing project and a physical data model for the data mart that you will build.

Before you begin

You must first complete the prerequisites described in Setting up the GSDB sample database.

Note: The steps in this tutorial are designed to work with a new copy of the sample database and data. If you or someone else previously worked on this tutorial or another tutorial that uses the GSDB sample database, you need to run the script again to reset the sample database.

Term: data warehouse

The central point of data integration for business intelligence and the source of data for data marts within an enterprise that delivers a common view of enterprise data.

Overview

The Design Studio is the component that you use to create data warehousing projects from your DB2® relational database. You can use the Design Studio to perform data movement, data mining, and OLAP tasks.

Term: data warehousing project

Consists of a physical data model, data flows, and control flows.

In this lesson, you create a data warehousing project and a physical data

model. This data model represents the GSDB database.

Term: physical data model

In DB2 data warehousing, a metadata model that contains the schema for your project, which includes the tables and data in the database. In later lessons, you will create a new schema and three new tables to create a data mart.

Tasks in this lesson

To create a data warehousing project and physical data model, do these tasks:

 1. Starting the Design Studio
 2. Connecting to the GSDB database
 3. Creating the SQW_Tutorial data warehousing project
 4. Reverse-engineering a physical data model

Creating a new schema and three new tables

In this lesson, you create a new schema called the MARTS schema, which you use to keep the warehousing objects in the database separate. You also create the MARTS.BRANCH_LOCATION, MARTS.TIME_DIMENSION, and MARTS.ORDER_FACT tables, which allow you to track sales by store location and over time.

Overview

You want to create a separate schema to contain objects (tables, views, and relationships) for your data mart. The original data that you need for the data mart is stored in the GOSALES.TIME_DIMENSION, GOSALES.BRANCH, GOSALES.ORDER_HEADER, and GOSALES.ORDER_DETAILS tables. Because you need only some of that data, you will create a new schema to manage only the information that you select.

In the schema, you will create three tables MARTS.BRANCH_LOCATION, MARTS.TIME_DIMENSION, and MARTS.ORDER_FACT. The MARTS.ORDER_FACT table is the target table that your data warehousing application populates, which allows you to track sales by store location and over time.

Term: schema

A collection of database objects such as tables, views, indexes, functions, user-defined types, or triggers that defines a database. An SQL schema provides a logical classification of database objects.

Tasks in this lesson

To create a new schema and new tables, do these tasks:

 1. Creating the MARTS schema
 2. Creating the MARTS.BRANCH_LOCATION table
 3. Creating the MARTS.TIME_DIMENSION table
 4. Creating the MARTS.ORDER_FACT table and adding foreign-key constraints
 5. Validating your physical data model

Updating the GSDB database

In this lesson, you propagate the changes that you made in your project to the GSDB database.

Overview

You created a physical data model that contains the MARTS schema and the GOSALES schema. You need to propagate the changes that you made in the MARTS schema to the GSDB database. You use the Generate DDL wizard to generate and run a DDL script to update the GSDB database.

Module 2: Designing data flows

In this module, you will design three data flows to populate the tables in your data warehousing project. Data flows define the SQL-based data movement and transformation activities that run in a DB2® database.

You can design a data flow to create SQL-based data movement and transformation by placing operators in a canvas, defining their properties, and connecting their ports. Data flows extract data from data sources (for example, from flat files or relational tables), transform the data by using operators, and then export it into data targets (for example, a data warehouse, a file, or a staging table).

The Design Studio provides an intuitive way to visualize and design data flows. Database metadata, which is imported to or created in the Data Project Explorer, describes the schemas of database objects. Graphical operators model the various steps of a data flow activity. By arranging these source, transform, and target operators in a canvas work area, connecting them, and defining their properties, you can create models that meet your business requirements.

In this module, you first create a variable that you will use in your flows, and then you create three data flows: populate_branch_location, populate_time_dimension, and populate_order_fact. These flows are used to extract data from the tables in the GOSALES schema and populate the following tables: MARTS.BRANCH_LOCATION, MARTS.TIME_DIMENSION, and MARTS.ORDER_FACT. After you create each data flow, you generate its SQL code. SQL code defines the operations that are performed by the SQL execution database when you run a data flow. Finally, you will verify the logic in your flows by running the populate_order_fact data flow in debug mode.

Learning objectives

After completing the lessons in this module, you will understand the concepts and know how to do the following tasks:

Create a variable to use in data flows
Create data flows
Debug data flows

Time required

This module should take approximately 1.5 hours to complete.

Creating a variable to use in your data flows

In this lesson, you will create a variable that represents a specific data range. This variable will be used in the populate_time_dimension data flow and the populate_order_fact data flow to extract the orders for a specific time period.

Overview

You will create a variable called DAYS_TO_SUBTRACT that you will use in your data flows. You can use the variable to easily change the time period to extract the data that you want.

You will create a group for the variable, name the variable DAYS_TO_SUBTRACT, give it a default value of 1, and set the last change phase to runtime.

Learn more about variables

Creating a data flow that populated the MARTS.BRANCH_LOCATION table

In this lesson, you create a data flow that will extract data from the GOSALES.BRANCH table, remove some unnecessary columns, and insert the required data into the MARTS.BRANCH_LOCATION table. You connect data flow operators on the Design Studio canvas to model processing paths between operators.

Overview

You can design a data flow to create SQL-based data movement and transformation by placing operators in a canvas, defining their properties, and connecting their ports. A data flow is the basic structure for designing applications with the Design Studio. Several data flows can be combined within a control flow. A typical data warehousing application can have one or more control flows. You can deploy an application and then run the control flows that are in the deployed application.

The following figure shows the data flow that you will create in this lesson.

Figure 1. The populate_branch_location data flow extracts data from the GOSALES.BRANCH table and populates the MARTS.BRANCH_LOCATION table

The populate_branch_location data flow consists of three operators: a Table Source operator, a Select List operator, and a Table Target operator.

Tasks in this lesson

To create the populate_branch_location data flow, do these tasks:

 1. Creating an empty data flow
 2. Adding a Table Source operator
 3. Adding a Select List operator
 4. Adding an SQL Merge operator
 5. Saving, validating, and running the data flow

Creating a data flow that populates the MARTS.TIME_DIMENSION table

In this lesson, you create a data flow that will extract data from the GOSALES.TIME_DIMENSION table, use operators to transform data, and populate the MARTS.TIME_DIMENSION table. You need the data from the GOSALES.TIME_DIMENSION table to drill down to various time increments and determine the profitability of each branch in the fictional Great Outdoors company.

Overview

The following figure shows the data flow that you will create in this lesson.

Figure 1. The populate_time_dimension data flow extracts data from the GOSALES.TIME_DIMENSION table and populates the MARTS.TIME_DIMENSION table

The populate_time_dimension data flow consists of four operators: a Table Source operator, a Select List operator, a Where Condition operator, and a Table Target operator.

Tasks in this lesson

To create the populate_time_dimension data flow, do these tasks:

 1. Creating an empty data flow
 2. Adding a Table Source operator
 3. Adding a Select List operator
 4. Adding a Where Condition operator
 5. Adding an SQL Merge operator
 6. Saving, validating, and running the data flow

Creating a data flow that populates the MARTS.ORDER_FACT table

In this lesson, you create the populate_order_fact data flow that will extract data from the GOSALES.ORDER_HEADER, GOSALES.ORDER_DETAILS, and GOSALES.ORDER_TIME_DIMENSION tables to populate the target table MARTS.ORDER_FACT. The MARTS.ORDER_FACT table will contain the sales data necessary to examine the profitability of the various locations over time.

Overview

The following figure shows the data flow that you will create in this lesson.

Figure 1. The populate_order_fact data flow populates the MARTS.ORDER_FACT table

The populate_order_fact data flow consists of the following operators: Table Source operators to source three data tables from the GOSALES schema that are used to create this flow, a Join operator, a Where Condition operator, a Fact Key Replace operator, and a Table Target operator.

Tasks in this lesson

To create the populate_order_fact data flow, do these tasks:

 1. Creating an empty data flow
 2. Adding three Table Source operators
 3. Adding a Table Join operator
 4. Adding a Where Condition operator
 5. Adding a Fact Key Replace operator
 6. Adding a Table Target operator
 7. Saving and validating the data flow

Running a data flow in debug mode

In this lesson, you will learn how to debug your data flows to discover potential causes of errors. You will use the populate_order_fact data flow in this lesson.

Tasks in this lesson

To debug the populate_order_fact data flow, do these tasks:

 1. Adding two breakpoints to the data flow
 2. Running in debug mode

Learn more about breakpoints The debugger runs the flow step by step. When you set a breakpoint, the debugger will run the flow to that point, and then you can view a sample of the rows of data to see the results of the debugging process.

When the debugger finishes processing, you can sample the rows of data at the two breakpoints so that you can see whether the flows run successfully up to and including the Table Join operation and the Where Condition operation.

Module 3: Creating control flows and preparing for deployment

In this module, you will learn how to create control flows and application packages in Design Studio.

You already set up data flows to conditionally process data into tables. Now you need to put these flows together into a complete control flow.

Although you can run a data flow during the development cycle in Design Studio (for debugging purposes), you cannot run a data flow in a test or production environment. Instead, you must run the data flows within control flows. In this module, you will create a control flow that invokes the data flows that you already created and you will prepare your flows for deployment on a test or production system.

Learning objectives

After you complete the lessons in this module, you will understand the concepts and know how to do the following tasks:

Create a control flow that runs a data flow that loads data into tables
Create a control flow that runs two data flows sequentially
Prepare a data warehouse application (control flow) for deployment by creating a deployment package (a .zip file)

Time required

This module should take approximately 30 minutes to complete.

Creating two control flows

In this lesson, you will create two control flows that will run data flows.

Overview

First, you will create a control flow that runs the populate_branch_location data flow, and then you will create the second control flow that runs the populate_time_dimension and populate_order_fact data flows. You know that you will need to run the populate_branch_location data flow only once on the server because the branch locations do not change often. You plan to run the other two data flows regularly to update transaction data. Therefore, you need to create two control flows so that you can run one once and schedule the other one to run at regular intervals.

Tasks in this lesson

To create two control flows, do these tasks:

 1. Creating a simple control flow to load the branch data
 2. Creating a control flow to populate the TIME_DIMENSION and ORDER_FACT tables

Creating a simple control flow to load the branch location

In this lesson, you will create two control flows that will run data flows.

Overview

First, you will create a control flow that runs the populate_branch_location data flow, and then you will create the second control flow that runs the populate_time_dimension and populate_order_fact data flows. You know that you will need to run the populate_branch_location data flow only once on the server because the branch locations do not change often. You plan to run the other two data flows regularly to update transaction data. Therefore, you need to create two control flows so that you can run one once and schedule the other one to run at regular intervals.

Tasks in this lesson

To create two control flows, do these tasks:

 1. Creating a simple control flow to load the branch data
 2. Creating a control flow to populate the TIME_DIMENSION and ORDER_FACT tables

Creating a control flow to populate the TIME_DIMENSION and ORDER_FACT tables

In this lesson, you create a deployment package that contains the control flows that you designed in the previous lesson. Deployment preparation is a task that you do in Design Studio and is a prerequisite to deploying an application in the Administration Console, which you will learn to do in the next module.

Overview

You need to create a data warehouse application that contains the necessary information to deploy the load_branch_data and populate_time_and_order_fact control flows on your test system. You will package these two control flows into a .zip file.

When you use the deployment preparation wizard, you select the control flows to include in the data warehouse application. You also need to define the resources and variables that the control flows will use. These attributes represent the application profile.

You need to create a control flow to populate the MARTS.TIME_DIMENSION and MARTS.ORDER_FACT tables.

Preparing a data warehouse application for deployment

In this lesson, you create a deployment package that contains the control flows that you designed in the previous lesson. Deployment preparation is a task that you do in Design Studio and is a prerequisite to deploying an application in the Administration Console, which you will learn to do in the next module.

Overview

You need to create a data warehouse application that contains the necessary information to deploy the load_branch_data and populate_time_and_order_fact control flows on your test system. You will package these two control flows into a .zip file.

When you use the deployment preparation wizard, you select the control flows to include in the data warehouse application. You also need to define the resources and variables that the control flows will use. These attributes represent the application profile.

Module 4: Deploying an application and running the control flows in the Administration Console

In this module, you will learn to use the Administration Console to deploy and manage the data warehousing application that you created in the Design Studio.

You created and tested a data warehouse application in the Design Studio in the previous modules. Now you want to move that application to a production environment. You do this by deploying the data warehouse application file from the Administration Console.

After the application is deployed, you will run the load_branch_data control flow in the production environment to load the MARTS.BRANCH_LOCATION table.

For the populate_time_and_order_fact control flow, you will create a schedule so that the flow will run every day. You will also specify a value for the DAYS_TO_SUBTRACT variable so that when the flow runs, it will extract the sales data for the past day. Finally, you will use the Administration Console to monitor your control flows and their activities.

Learning objectives

After you complete the lessons in this module, you will understand the concepts and know how to do the following tasks:

Create a database connection
Deploy a data warehousing application
Run the control flows
Create a schedule to run a control flow automatically
Monitor a control flow instance and view logs and statistics

Time required

This module should take approximately 30 minutes to complete.

Starting the Administration Console and creating a connection

In this lesson, you will create a database connection in the Administration Console. After you define a database connection, it becomes available so that you can select it when you deploy a data warehousing application.

Overview

The Administration Console is a Web application that you can use to deploy and manage applications, control flows, database resources, and system resources.

For each database that your applications need to connect to, you can create one or more database connections in the Administration Console. For example, if you want to use the same database for both cube modeling and SQW applications, you can use the same database connections for both purposes. However, you can also create separate connections.

Tasks in this lesson

To create a database connection, do these tasks:

 1. Starting the Administration Console
 2. Creating a connection to the GSDB database

Deploying the application to the Administration Console

In this lesson, you deploy the data warehousing application that you created in Design Studio. By deploying a new application, you make the application's control flows available to schedule, run, and manage.

Overview

Learn more about deployment

Running and scheduling the control flows

In this lesson, you run the control flows that are in the newly deployed application. To load target tables for your data mart, you also schedule one of the control flows to run regularly.

Overview

An individual control flow in the runtime environment is equivalent to a control flow in the design-time environment. When you schedule or start a control flow, you are running all of the activities within a particular control flow, including its data flows.

Learn more about control flow instances

After you start a control flow, a unique control flow instance is created.

You can monitor the instance to know the execution status. Control flow instances are specific jobs that are created when you launch a control flow or start a schedule from the Administration Console. Each instance has a specific instance ID, either user-defined or system-generated, that you can use to monitor multiple instances of the same control flow when they run.

Each control flow instance might also have its own set of variable values, especially runtime instance values that you must specify when you start or schedule the control flow.

Tasks in this lesson

To run a control flow, do these tasks:

 1. Running the load_branch_data control flow
 2. Running the populate_time_dimension_and_order_fact control flow
 3. Scheduling a control flow to update the TIME_DIMENSION and ORDER_FACT

tables

Scheduled

running. |

Running
The first state for any unscheduled control flow instance, regardless of how the instance started. If you start a control flow immediately, or restart a failed or stopped instance, the instance enters the Running state. For a scheduled control flow instance, this is the state a control flow moves into when its scheduled time arrives.
Successful
The final state that a control flow instance moves into when all of its activities run to completion.
Warnings
The final state that a control flow instance moves into when the instance runs to completion and the failure of an activity is ignored because the activity failure is set to ignore in the control flow.
Failed
The potentially final state that a control flow instance moves into when running the instance fails at some point in the process flow. Failed control flow instances can be restarted. When you restart an instance, it moves from the Scheduled state to Running. Instances restart from the beginning of the process, not from the point of failure.
Stopping
The interim state that a control flow instance moves into when you stop it from the console. When you stop an instance, it remains in the Stopping state until the currently running activity finishes. If a failure occurs before a stop request can be processed, the instance moves to the Failed state, not the Stopped state.
Stopped
The potentially final state that a control flow instance moves into when you stop it from the console. Stopped control flow instances can be restarted. When you restart an instance, it moves from the Stopped state to Running. Instances restart from the beginning of the process, not from the point where they were stopped.

Tasks in this lesson

To monitor control flow instances, do these tasks:

 1. Viewing the log
 2. Viewing the statistics of a control flow and its activities
 3. Viewing the statistics for an instance of a control flow and its activities