In this guide, we will walk through how to setup dbt Core in the cloud with Redshift. After you finish this guide, you will have the sample data provided uploaded to Redshift and run your first dbt command in the cloud.
Although the steps in this guide will specifically utilize Redshift, the steps can be modified slightly to work with any database that dbt supports. We also have guides made specifically for Bigquery, Databricks, and Snowflake.
If you'd rather watch a video version of all three chapters of this guide, head over to YouTube. Let's jump right in!
Before getting into the steps of setting up the different cloud data warehouses, please download the sample files that we will use for this tutorial here.
Load data into S3
- FNavigate into Redshift and use the navigation menu to go to the query editor.
- Navigate into the bucket you created for this tutorial by clicking on its name.
- Create a folder inside of your bucket named
fivethirtyeight_football
by clicking the Create Folder button. - Once the folder is created, navigate inside of it.
- Click the Upload button to begin the process of uploading our sample files.
- Click Add Files.
- Select the two files from your file system and click open. After you do that, your page should look like this:
- Click Upload. After the upload is complete, you should be shown an upload succeeded banner that looks like this:
Create Tables in Redshift
- Navigate into Redshift and to the query editor.
- Create new schema for our sample data named
soccer
by running this query:
create schema if not exists soccer
- Create the tables inside of our new soccer schema to hold our uploaded data in S3. This query will accomplish that:
create table soccer.stg_football_rankings(
rank integer,
prev_rank integer,
name varchar(255),
league varchar(255),
offense float,
def float,
spi float
);
create table soccer.stg_football_matches(
season integer,
date date,
league_id integer,
league varchar(255),
team1 varchar(255),
team2 varchar(255),
spi1 float,
spi2 float,
prob1 float,
prob2 float,
probtie float,
proj_score1 float,
proj_score2 float,
importance1 float,
importance2 float,
score1 integer,
score2 integer,
xg1 float,
xg2 float,
nsxg1 float,
nsxg2 float,
adj_score1 float,
adj_score2 float
);
Now that we have our tables setup in Redshift. We need to load the data from S3 into the tables.
Load data from S3 into Redshift Tables
- Navigate to S3 and find the files that we uploaded in the prior steps.
- Click the name of each table to locate the S3 URI.
- Copy and paste the S3 URIs to a notepad for use later in these steps.
- Navigate back to the Redshift console.
- Run the following two queries replacing S3 URI, IAM_role, and region with the values that are specific to you:
copy soccer.stg_football_matches( season, date, league_id, league, team1, team2, spi1, spi2, prob1, prob2, probtie, proj_score1, proj_score2, importance1, importance2, score1, score2, xg1, xg2, nsxg1, nsxg2, adj_score1, adj_score2)
from 'S3 URI'
iam_role 'arn:aws:iam::XXXXXXXXXX:role/RoleName'
region 'us-east-1'
delimiter ','
ignoreheader 1
acceptinvchars;
copy soccer.stg_football_rankings( rank, prev_rank, name, league, offense, def, spi)
from 'S3 URI'
iam_role 'arn:aws:iam::XXXXXXXXXX:role/RoleName'
region 'us-east-1'
delimiter ','
ignoreheader 1
acceptinvchars;
You should now be able to query soccer.stg_football_rankings
and soccer.stg_football_matches
. Feel free to run this query to verify that this process worked successfully:
select * from soccer.stg_football_matches
dbt Core Part 2 - Setting Up dbt on Github
Fork dbt Setup from GitHub
- Fork this repository. The repository contains the beginning state of a dbt project.
- Clone the repository locally on your computer.
- Open
dbt_project.yml
in your text editor.
dbt Project File Setup
- Change the project name to
soccer_538
. - Change the profile to
soccer_538
. - Change model name to
soccer_538
. - Under the soccer_538 model, add a
staging
andmarts
folder that are both materialized as views. - Save your changes.
Profile Setup
- Open
profiles.yml
. - Update the file to this:
soccer_538:
target: dev
outputs:
dev:
type: redshift
host: hostname.region.redshift.amazonaws.com
user: "{{ env_var('redshift_username') }}"
password: "{{ env_var('redshift_password') }}"
port: 5439
dbname: analytics
schema: soccer
threads: 4
keepalives_idle: 240 # default 240 seconds
connect_timeout: 10 # default 10 seconds
ra3_node: true
```
3. Create a new file in your root directory of your dbt project called `execute_dbt.py`.
4. Paste this code block for the content of `execute_dbt.py`:
```python
import subprocess
import os
import json
dbt_command = os.environ.get('dbt_command', 'dbt run')
subprocess.run(['sh', '-c', dbt_command], check=True)
- Commit and push your changes to Github.
Now that we have our sample data and dbt processes setup, we need to write our example models for the dbt job to run.
dbt Models
- Navigate into the models folder in your text editor. There should be a subfolder under models called
example
. Delete that subfolder and create a new folder called538_football
. - Create two subfolders inside
538_football
calledstaging
andmarts
.
- Inside the staging folder, create a file called stg_football_matches.sql.
- Paste the following code into that file:
select * fromsoccer.stg_football_matches
- Inside the staging folder, create a file called
stg_football_rankings.sql
- Paste the following code into that file:
select * fromsoccer.stg_football_rankings
- In the staging folder, add a file called
schema.yml
. - In this file, paste the following information:
version: 2
models:
- name: stg_football_matches
description: Table from 538 that displays football matches and predictions about each match.
- name: stg_football_rankings
description: Table from 538 that displays a teams ranking worldwide
- In the marts folder, create a file called
mart_football_information.sql
. - Paste the following code into that file:
with
qryMatches as (
SELECT * FROM {{ ref('stg_football_matches') }} where league = 'Barclays Premier League'
),
qryRankings as (
SELECT * FROM {{ ref('stg_football_rankings') }} where league = 'Barclays Premier League'
),
qryFinal as (
select
qryMatches.season,
qryMatches.date,
qryMatches.league,
qryMatches.team1,
qryMatches.team2,
team_one.rank as team1_rank,
team_two.rank as team2_rank
from
qryMatches join
qryRankings as team_one on
(qryMatches.team1 = team_one.name) join
qryRankings as team_two on
(qryMatches.team2 = team_two.name)
)
select * from qryFinal
- In the marts folder, add a file called
schema.yml
- In this file, paste the following:
version: 2
models:
- name: mart_football_information
description: Table that displays football matches along with each team's world ranking.
- Save the changes.
- Push a commit to Github.
We are ready to move into Shipyard to run our process. First, you will need to create a developer account.
dbt Core Part 3 - Setting Up dbt on Shipyard
Create Developer Shipyard Account
- Navigate to Shipyard's sign-up page here.
- Sign up with your email address and organization name.
- Connect to your Github account by following this guide. After connecting your Github account, you'll be ready to create your first Blueprint.
Creating dbt Core Blueprint
- On the sidebar of Shipyard's website, click Blueprints.
- Click Add Blueprint on the top right of your page.
- Select Python.
- Under Blueprint variables, click Add Variable.
- Under display name, enter
dbt CLI Command
. - Under reference name, enter
dbt_command
. - Under default value, enter dbt run.
- Click the check box for required
- Under placeholder, enter
Enter the command for dbt
. - Click Next
- Click Git.
- Select the repository where your dbt files sit.
- Click the source that you want the files pulled from. Generally main or master.
- Under file to run, enter
execute_dbt.py
. - Under Git Clone Location, select the option for Unpack into Current Working Directory.
- Click Next Step on the bottom right of the screen.
- Next to Environment Variable, click the plus sign to add an environment variable.
Add Environment Variables
The environment variables that need to be added will vary based on the cloud database that you use.
Variable Name | Value |
---|---|
DBT_PROFILES_DIR | . |
redshift_username | username from redshift |
redshift_password | password from redshift |
Python Packages
- Click the plus sign next to Python Packages.
- In the Name field, enter dbt-redshift. In the version field, enter
==1.0.0
. - Click Next.
Blueprint Settings
- Under Blueprint Name, enter
dbt - Execute CLI Command
. - Under synopsis, enter
This Blueprint runs a dbt core command.
- Click Save.
- In the top right of your screen, click Use this Blueprint. This will take you over to the Fleet Builder and prompt you to select a project.
Build dbt Core Fleet
- On the Select a Project prompt, click the drop down menu to expand it and select
Create a New Project
. - Under project name, enter
dbt Core Testing
. - Under timezone, enter your timezone.
- Click Create Project.
- Select dbt Core Testing and click Select Project. This will create a new Fleet in the project. The Fleet Builder will now visible with one Vessel located inside of the Fleet.
- Click on the Vessel in the Fleet Builder and you will see the settings for the Vessel pop up on the left of your screen.
- Under Vessel Name, enter
dbt Core CLI Command
. - Under dbt CLI Command, enter
dbt debug
. - Click the gear on the sidebar to open Fleet Settings.
- Under Fleet Name, enter
dbt Core
. - Click Save & Finish on the bottom right of your screen.
- This should take you to a page showing that your Fleet was created successfully.
- Click Run Your Fleet. This will take you over to the Fleet Log.
- You can click on the bar to get the output from your run.
If you scroll to the top of the output, you will see that the environment variables that were put in during the Blueprint creation process are hidden from the user.
If dbt debug succeeds, we are ready to move into part three of the guide. If it fails, please go back to the steps above and make sure everything is setup correctly. Feel free to send an Intercom message to us at anytime using the widget on the bottom right of the Shipyard screen.
In the meantime, please consider subscribing to our weekly newsletter, "All Hands on Data." You'll get insights, POVs, and inside knowledge piped directly into your inbox. See you there!