First of all, if you want to really learn how work with SQL, you can learn it on sql-ex.ru for free.
So far, you’ve seen diagrams linking dimension and fact tables together, but in a database these are just tables. The relations in the diagrams are foreign key constraints, which means that it will not be possible to insert a product key into the orders fact table if this key does not exist in the dim_product dimension table. Getting data out of the database is another thing. For this, you need a query language called Structured Query Language, or SQL. If you’re not familiar with SQL, don’t worry—the SQL you need for retrieving information from a star schema is not very complicated. And although this book is not a SQL text, we cover enough of the basics here to enable you to start writing queries to retrieve data from dimensional data marts. All we need for this is the SELECT statement, a way of joining the different tables together, and an understanding of how to group the results from the query. For starters, have a look at the basic SQL building blocks for selecting data:
If only the total revenue is needed from the fact table, there’s no need to use
JOINs or GROUP BY clauses, just the following:
SELECT SUM(revenue) FROM fct_orders
This little query doesn’t tell you very much, so you can add the date dimension to calculate the revenue by year:
SELECT date_year, SUM(revenue)
FROM fct_orders
JOIN dim_date
ON fct_orders.date_key = dim_date.date_key GROUP BY date_year
Now you have a simple query that is the basis for all other possible queries that can be executed against a dimensional data model. The first line with the SELECT clause tells the database what the result set will look like. Next, you need to tell where the data is retrieved from by using the FROM clause.
The SQL examples used so far still lack the use of aliases. An alias is another name you can use to refer to a table, column, or expression. Although they are not always required, it’s good practice to always use them. Adding the alias total_revenue to the expression SUM(revenue) gives more meaning to the result column, and using aliases for table names enables a shorthand notation of the join conditions. Compare the following JOIN clauses to understand what we mean by that:
FROM fct_orders
JOIN dim_date ON fct_orders.date_key = dim_date.date_key
versus
There are cases in which the use of aliases is not optional but required. The
first practical case where aliases are at least very helpful is when the same
column name appears in more than one table of the set used in the query.
Suppose that in the previous example you wanted to include date_key in the
result set. Because this column is part of both the fact and the dimension table,
the SQL parser cannot determine which table to pick the column from, so you need to include the table reference in your SELECT statement—for example,
SELECT dim_date.datekey. Now it’s immediately obvious that the use of a
short alias such as d will save a lot of typing. The complete query statement,
including the use of aliases, now looks as follows:
The WHERE part of this query is covered in more detail shortly.
So far, you’ve seen diagrams linking dimension and fact tables together, but in a database these are just tables. The relations in the diagrams are foreign key constraints, which means that it will not be possible to insert a product key into the orders fact table if this key does not exist in the dim_product dimension table. Getting data out of the database is another thing. For this, you need a query language called Structured Query Language, or SQL. If you’re not familiar with SQL, don’t worry—the SQL you need for retrieving information from a star schema is not very complicated. And although this book is not a SQL text, we cover enough of the basics here to enable you to start writing queries to retrieve data from dimensional data marts. All we need for this is the SELECT statement, a way of joining the different tables together, and an understanding of how to group the results from the query. For starters, have a look at the basic SQL building blocks for selecting data:
- SELECT—A list of the columns, constants, and expressions you want to retrieve from the database.
- FROM—The tables and views that contain the data you need, including the relationships between those tables and views.
- WHERE—The restrictions you want to apply to the data in the selected tables, views, and expressions, excluding the aggregations.
- GROUP BY—Specifies the summary level of the query. All non-aggregate columns and expressions in the SELECT list should be part of the GROUP BY statement.
- HAVING—Contains conditions on summarized expressions, e.g., having sum (revenue) > 100,000.
- ORDER BY—Denotes the order in which the results will be displayed.
If only the total revenue is needed from the fact table, there’s no need to use
JOINs or GROUP BY clauses, just the following:
SELECT SUM(revenue) FROM fct_orders
This little query doesn’t tell you very much, so you can add the date dimension to calculate the revenue by year:
SELECT date_year, SUM(revenue)
FROM fct_orders
JOIN dim_date
ON fct_orders.date_key = dim_date.date_key GROUP BY date_year
Now you have a simple query that is the basis for all other possible queries that can be executed against a dimensional data model. The first line with the SELECT clause tells the database what the result set will look like. Next, you need to tell where the data is retrieved from by using the FROM clause.
You also want to JOIN another table and you need to tell the database which
columns the join will be based on—in this case, the date_key in the dimension
and fact tables. Finally, the results are grouped by all elements of the SELECT
statement that are not part of an aggregate function. For example, when the
SELECT statement looks like this:
the GROUP BY statement needs to be
GROUP BY date_year, product_type, product_name
To complete this query all you need to add are the FROM and JOIN parts:
SELECT date_year, product_type, product_name, SUM(revenue), SUM(quantity)
the GROUP BY statement needs to be
GROUP BY date_year, product_type, product_name
To complete this query all you need to add are the FROM and JOIN parts:
The SQL examples used so far still lack the use of aliases. An alias is another name you can use to refer to a table, column, or expression. Although they are not always required, it’s good practice to always use them. Adding the alias total_revenue to the expression SUM(revenue) gives more meaning to the result column, and using aliases for table names enables a shorthand notation of the join conditions. Compare the following JOIN clauses to understand what we mean by that:
FROM fct_orders
JOIN dim_date ON fct_orders.date_key = dim_date.date_key
versus
FROM fct_orders AS f
JOIN dim_date AS d ON f.date_key = d.date_key