Unlocking Data Modeling Success: 3 Must-Have Contextual Tables

And how to ingest valuable data for free

Matthew Gazzano
Towards Data Science

--

Photo by Tobias Fischer on Unsplash

Data modeling can be a challenging task for analytics teams. With unique business entities in every organization, finding the right structure and granularity for each table becomes open-ended. But fear not! Some of the data you need is simplistic, free, and occupies minimal storage.

When your data is modeled in full, you can see the following benefits:

  • Queries are less complex to generate, and therefore more readable.
  • Reports are more scalable, reducing hard-coded values.
  • You are likely spending less time finding where the right data lives.

Below are 3 generic tables that can streamline your team’s analytics, which you can ingest into your Data Warehouse in the context of a dimensional model.

🗓️The Date Dimension

For Timeseries Reporting

If you have ever needed to show a business metric as it was at a given point in time, this is a nearly essential table to have. For example, you may be asked:

  • “What did sales look like in FY23?”
  • Can you show me client churn on a daily basis?

Management frequently seeks insights from a timeseries perspective, asking questions like “How is x growing or shrinking over time?”. A date dimension enables flexible analysis of various metrics based on different date attributes.

Most Date Dimension tables can be created solely using DDL statements directly in your Data Warehouse, with a mix of date functions.

In the below example, I use BigQuery SQL to do just that:

CREATE OR REPLACE TABLE `your_project.your_dataset.date_dimension` AS

SELECT
full_date
, EXTRACT(MONTH FROM full_date) AS calendar_month_number
, EXTRACT(YEAR FROM full_date) AS calendar_year
, EXTRACT(QUARTER FROM full_date) AS calendar_quarter
, FORMAT_DATE('%B', full_date) AS calendar_month_name
, EXTRACT(DAYOFWEEK FROM full_date) AS week_name
, FORMAT_DATE('%A', full_date) AS day_name
, CASE
WHEN EXTRACT(DAYOFWEEK FROM full_date) BETWEEN 2 AND 6
THEN TRUE
ELSE FALSE
END AS day_is_weekday
, CASE
WHEN EXTRACT(DAYOFWEEK FROM full_date) = 1 THEN DATE_SUB(full_date, INTERVAL 2 DAY) -- Sunday
WHEN EXTRACT(DAYOFWEEK FROM full_date) = 2 THEN DATE_SUB(full_date, INTERVAL 3 DAY) -- Monday
ELSE DATE_SUB(full_date, INTERVAL 1 DAY)
END AS last_weekday
, EXTRACT(MONTH FROM DATE_ADD(full_date, INTERVAL 6 MONTH)) AS fiscal_month
, EXTRACT(YEAR FROM DATE_ADD(full_date, INTERVAL 6 MONTH)) AS fiscal_year
, EXTRACT(QUARTER FROM DATE_ADD(full_date, INTERVAL 6 MONTH)) AS fiscal_quarter
FROM UNNEST(GENERATE_DATE_ARRAY('2020-01-01', '2050-12-31', INTERVAL 1 DAY)) AS full_date

Breaking this down:

  • We start with the GENERATE_DATE_ARRAY function, which returns an array of dates in a range that you specify. We then use the UNNEST function to break each element of the array into separate rows, just like in a standard database table. Arrays in BigQuery use one row to display multiple values.
  • Then with our full_date column that was generated from the unnested array (which represents dates in the format of XXXX-MM-DD), we can use many of BigQuery’s EXTRACT functions to get segments of full_date into separate fields (month, day, year, etc.)
  • The FORMAT_DATE function has a similar purpose to EXTRACT, but gives us more customization on how the date value appears. You can use special format elements described in Google’s documentation for details on what each ‘%’ character means.
  • We also use the DATE_SUB function which simply subtracts a value from one date. This is used to get the related fiscal year, which in this example would start on July of every calendar year. In this function, we specify a quantity (1 — infinity), and the interval (day, months, years etc.)

🌎The Zip Code Dimension

For Geospacial Reporting

If you have been tasked to create a heatmap visualization, or a general geospatial analysis — a zip code dimension will be of great use for your team. This gives you the option to visualize elements by latitude and longitude, aggregate by county names, time zone, and append population data for benchmarking.

A Zip Code Dimension is a great supplemental table to a customer table. By using the Zip Code field as the join key, you can append meaningful contextual data to where your customer base resides, and patterns behind them.

Opendatasoft provides various free datasets with an open-source API connector. One that is perfect for this use case is the US Zip Codes Points- United States of America dataset. On this link, hover over to the ‘API’ tab which lets you configure the URL to retrieve JSON data.

With a few lines in Python, we can output the following Pandas DataFrame:

import requests
import pandas as pd

url = 'https://data.opendatasoft.com/api/records/1.0/search/?dataset=georef-united-states-of-america-zc-point%40public&q=&facet=stusps_code&facet=ste_name&facet=coty_name&facet=cty_code&facet=zip'
response = requests.get(url)
zips = response.json()

pd.json_normalize(zips,record_path='records')

Breaking this down:

  • Here I use the requests library to retrieve data found at the following URL shown in the ‘url’ variable, which was generated on the API tab of the opendatasoft website.
  • With Pandas, I use the json_normalize function to convert the JSON data into a Pandas DataFrame.

📈The FX Rates Fact Table

For Financial Analysis

Organizations with international clients often need to convert all transactions to a base currency for financial reporting. To understand how a foreign exchange rate fluctuation may be impacting revenue, a daily FX rate feed is a great solution to answer this question.

This is also particularly impactful when paired with time series reporting, to append exchange rates at the time of a given sale. In my experience building dashboards which show client revenue over a period of time, business users always appreciated the ability to toggle different FX rate values against revenue on a date axis. The FX Rates table lets you accomplish all of this.

Exchangerate.host is another open-source site that allows you to connect to a daily FX rates feed. Below is an example of how to retrieve the data — more info can be found on their documentation:

import requests
import pandas as pd
from datetime import date

#Retrieve the latest dates from the exchangerate api
url = 'https://api.exchangerate.host/latest?base=USD'
response = requests.get(url)
rates = response.json()

# Convert JSON list to a Pandas Dataframe & preview
rates_list = list(rates['rates'].items())
df_rates = pd.DataFrame(rates_list, columns=['currency', 'value'])
df_rates['cycle_date'] = date.today()
df_rates.head(10)

Breaking this down:

  • Here we use the requests library to retrieve data in the form of a Python dictionary from the URL specified in the ‘url’ variable. Please note — I edited the URL per the documentation to specify the base currency to USD. Meaning, all rates will relate to their USD exchange rate.
  • Next, we convert the Python dictionary to a list, taking the data from the ‘rates’ key.
rates_list = list(rates['rates'].items())
  • Then, we convert the list into a Pandas DataFrame and label the column headers:
df_rates = pd.DataFrame(rates_list, columns=['currency', 'value'])
  • Finally a column is added ‘cycle_date’ which represents the ETL cycle date, denoting when the data is ingested into the Data Warehouse.

Conclusion

Incorporating publicly available data into your Data Warehouse can provide immediate value to analytics teams with minimal effort. These tables, as well as any properly modeled data entity, eliminate the need to store nested business logic solely in BI tools like Power BI or Tableau. They instead provide a centralized source of data that multiple analysts can reference and apply consistently in their reporting. This cohesive approach to data modeling empowers teams to scale reporting effortlessly, ensuring transparency into the source data. With the ability to leverage these types of contextual tables, your organization can streamline analytics processes, eliminate discrepancies in reporting, and achieve a higher level of data-driven decision-making.

Happy Modeling!

--

--