Are you running dbt Core locally on your machine and wanting to take your process to the cloud? Maybe you want to automate your dbt processes to run after an event or on a schedule. In this blog post, we will walk through the steps to get your dbt processes automated.
This guide assumes that your dbt files are are stored in a GitHub repository. While the steps in this guide can be followed with a few variations if the files are on your local machine, we recommend that you put your files into a GitHub repository before continuing.
Hosting dbt in the Cloud
Taking dbt from a local machine to the cloud can be quite the complicated process. Chrisophe Blefari outlined all of the ways to do that in a recent blog post. The majority of his options involve hosting your server or running your dbt process through a cloud service provider such as AWS or GCP. In this guide, we will use Shipyard. In Shipyard, you don't have to worry about hosting or managing a server.
Upload Python Script for dbt Commands
To allow you to send commands to dbt through shipyard, you need to add a Python script to your GitHub repository. Download the file that corresponds to the database where you store your data and rename it to execute_dbt.py. Then, upload it to the root directory of your GitHub repository.
We currently have Python scripts available for BigQuery, Databricks, Redshift, and Snowflake. If you are running dbt in any other database, feel free to reach out to us, and we can help you create a Python script.
Add your Base Profile
For your dbt process to run in Shipyard, you will need to have your profiles.yml file in the root directory of your GitHub repository. This file gives dbt the information to connect to your database. If you don't have a profiles.yml file created for your dbt process, head over to the dbt docs to create one for your specific database.
The profiles.yml file will have sensitive information on how to connect to your database. To avoid sending these credentials to GitHub, you can use environment variables to hide that information from public viewing.
Connect GitHub to Shipyard
Head on over to the Shipyard website and sign in. Once you're logged in, use the sidebar to navigate to the "Admin" tab. From there, select "Integrations" from the drop-down menu.
Now it's time to connect your Shipyard account to GitHub. Click the "GitHub" button and then "Add Connection." You'll be taken to a page on GitHub where you'll need to sign in and choose the organization where your dbt models are located. Next, you'll have the option to grant Shipyard access to all of your organization's repositories or just the one that contains your dbt models. Make your selection and then click "Install."
This will redirect you back to Shipyard's "Admin" page, where you should now see your GitHub connection displayed on the right side of the page.
Create a Vessel to Run dbt
To create a new fleet in the Fleet Builder, click "New Fleet" on the top left of your screen. Then, select the project that you want your dbt job to live in and click "Select Project". In the code vessels section, click "Python" to create a vessel. Give the vessel a name, such as "Run dbt CLI Command," and under "File to Run," enter "execute_dbt.py". Click "Git" and choose the repository where your dbt files are located under "Repo." Finally, select "main" under "Code Source."
Add Environment Variables
Below the code section for the Vessel, you should see a section called Environment Variables. Click it to expand the menu and add the variables below:
Variable Name | Value |
---|---|
DBT_COMMAND | dbt debug |
DBT_PROFILES_DIR | . |
Once you have your environment variables set for your database, look below to expand the menu for Python Packages.
Python Packages
The Python package that you will need to install will differ based on the database that you are using. Click the plus sign next to Python Packages and add the package that is listed with your database below:
Database | Package Name | Version |
---|---|---|
BigQuery | dbt-bigquery | ==1.3.0 |
Databricks | dbt-databricks | ==1.3.2 |
Redshift | dbt-redshift | ==1.3.0 |
Snowflake | dbt-snowflake | ==1.3.0 |
Fleet Settings
Click the gear icon on the left side of your screen. This will bring up the Fleet Settings menu. From here, you can enter a name for your fleet under the "Fleet Name" field. For example, you might choose "dbt Core CLI Command." Once you've entered a name, you're all set to run your Fleet.
Click Save & Finish on the bottom right of your screen. This will redirect you to a page that will tell you that your Fleet is created successfully. Click Run Now to start your Fleet run. This will redirect you to the Fleet Log page where you can watch your Fleet run in real time.
Once the Fleet has finished running successfully, you have successfully connected dbt to your database in the cloud with Shipyard. While the process that we used can be replicated for all of your dbt processes, it would involve going through this process for every Vessel that you need which would become tedious. We recommend that you build a dbt Core Blueprint that allows you to quickly replicate this process and create more dbt Core Vessels with a single click.
Next Steps
If you're using dbt in your data infrastructure, you know how powerful it can be for transforming and modeling your data. But managing dbt jobs and processes can sometimes be a time-consuming task. That's why we recommend taking a few steps to streamline your workflow.
One idea is to create a dbt Core Blueprint that outlines the steps of your dbt process, making it easier to replicate in the future.
Another tip is to set up a schedule for your dbt jobs so that they run automatically at predetermined times.
And for even more automation, consider setting up a webhook trigger that will allow you to kick off dbt jobs in response to certain events.
These strategies can save you time and effort in the long run, leaving you with more energy to focus on analyzing and using your data.