Building a Blueprint with Great Expectations
Overviewβ
In this tutorial, you'll walk through the steps required to set up Great Expectations to run in the cloud, on Shipyard. We will be creating a Blueprint that can be re-used by multiple team members and updated in the background. This tutorial is only in Python.
By the end of the tutorial, you'll be able to:
- Set up a Blueprint using Python
- Successfully run Great Expectations on Shipyard
- Share expectations with your organization
- Run multiple instances of Great Expectations simultaneously
- Integrate an Expectation Suite into your Fleets
For more information, read our blog post that covers Getting Started with Great Expectations. You can also visit www.greatexpectations.io for additional information.
Setupβ
For the sake of the this tutorial, we will be building a Fleet inside of a Project called "Default". You can do this by logging into Shipyard and, on the sidebar, click on New Fleet to begin.
Download the following file to your computer, without changing the file name. It's a .zip containing a single python file and a Great Expectations directory structure with JSON expectation suites and a YML setup file. We'll use this throughout the tutorial.
Feel free to peruse this script beforehand so you understand everything that it's doing. The main script is accomplishing the following things:
- Downloading a file from a public URL.
- Decompressing the file if it is a
.gz
file and converting the file into a CSV if it is not one already. - Running Great Expectations against the downloaded file, using the included sample expectation suites.
- Uploading the validation output to S3, using a file name structure that reference's Shipyard's Platform Environment Variables.
- Printing the validation results to the standard output.
- Returning the appropriate exit code based on expectation results.
Stepsβ
- Click Blueprints on the side navigation bar.
- Click the Add Blueprint button in the top right.
Step 1 - Select A Languageβ
Click on Python. You'll be immediately redirected to the next step.
Step 2 - Create Blueprint Variablesβ
Click the + icon to create a new Blueprint variable. You should see a screen that looks like this:
Our code for Great Expectations has 3 variables that we expect to receive. For a detailed overview of each of these fields, read more about Blueprint Variables.
File URLβ
- Set the Display Name to
File URL
- Set the Reference Name to
input_url
- Leave the Variable Type set to Alphanumeric.
- Leave the Default Value empty.
- Check the box for Required?
- Set the Placeholder to
https://s3.region.amazonaws.com/bucket-name/key-name.csv
- Set the Tooltip to
URL to download the file from. Must be publicly accessible.
- Click Add Variable.
Bucket Nameβ
- Set the Display Name to
Bucket Name
- Set the Reference Name to
output_bucket_name
- Leave the Variable Type set to Alphanumeric.
- Set the Default Value to the bucket name you set up during the setup phase.
- Leave the Required field alone.
- Leave the Placeholder empty.
- Set the Tooltip to
Bucket Name to store the validation JSON files.
- Click Add Variable.
Expectation Suiteβ
- Set the Display Name to
Expectation Suite
- Set the Reference Name to
expectation_suite
- Change the Variable Type to Select
- Under the new section of Selection Options click the + button twice.
- Set the first Display Name box to
Amazon Reviews
and set the Internal Value toamazon-product-reviews
. - Set the second Display Name box to
Sample
and set the Internal Value tosample-suite
- Set the first Display Name box to
- Set the Default Value to
Amazon Reviews
. - Leave the Required? field alone.
- Leave the Placeholder empty.
- Set the Tooltip to
Select which of our Expectation Suites to use against the provided file.
- Click Add Variable.
Give your Blueprint a Description of Provide a Link to a publicly available file in the File URL field. This file will be run against the Expectation Suite selected, with the final validation file sent directly to the S3 Bucket listed under "Bucket Name", nested under a folder called great-expectations/{expectation-suite}/
At this point, your screen should look something like this.
Click Preview this Blueprint to verify how everything will look and feel to a user.
Once you've verified that everything is set up correctly, go ahead and click Next Step.
Step 3 - Provide Your Codeβ
- Click the upload section of the page and select the
great_expectations_demo.zip
file from your computer. - On the right-hand side of the screen, enter
run_great_expectations.py
into the File to Run field. - Click the + icon next to arguments 3 times.
We'll be creating an argument for each of the Blueprint Variables that we created in the last step, passing through the user input as ${reference_name}
.
- In the first set of fields, type
--input_url
for the flag and${input_url}
for the value. - In the second set of fields, type
--output_bucket_name
for the flag and${output_bucket_name}
for the value. - In the final set of fields, type
--expectation_suite
for the flag and${expectation_suite}
for the value.
Once these steps are complete, your screen should look exactly like this.
Once you've verified that everything has been set up correctly, click Next Step in the bottom right.
Step 4 - Requirementsβ
Environment Variablesβ
- Click the + icon next to Environment Variables twice to add two new variables.
- Set the first variable's Name to
GREAT_EXPECTATIONS_AWS_ACCESS_KEY_ID
and Value to the Access Key ID of the bucket you chose during your Setup. - Set the second variable's Name to
GREAT_EXPECTATIONS_AWS_SECRET_ACCESS_KEY
and Value to the AWS Secret of the bucket you chose during your Setup.
The value field will always show β’β’β’β’β’β’β’
as you type. This is because Environment Variables are commonly used for passwords and secrets. You can always reveal what you've written by clicking the eye icon.
Packagesβ
- Click the + icon next to Packages 4 times to add four new packages.
- Set the first Package Name to
boto3
and the version to==1.12.16
- Set the second Package Name to
great-expectations
and the version to==0.9.5
- Set the third Package Name to
pandas
and the version to==1.0.1
- Set the fourth Package Name to
wget
and the version to==3.2
Your screen should look similar to this:
Once you're done, go ahead and click the Next Step button at the bottom of the screen.
Step 5 - Settingsβ
- Under the State section, select Everyone.
- Under the Information section:
- Give your Blueprint the name of
Great Expectations - Demo
. - Give your Blueprint the Synopsis of
Run a file against an existing Expectation Suite.
- Give your Blueprint the name of
- Leave the Guardrails section defaults of None and ASAP.
Your Blueprint should look like this:
- Click the Save & Finish button at the bottom of the screen.
You've successfully set up Great Expectations as a Blueprint.
Now anyone in your organization can use the Blueprint to test data against your Expectation Suites. We're going to test our Blueprint to validate that everything runs correctly.
Step 6 - Setting Up a Fleetβ
- Click Use this Blueprint on the success screen.
You should now be in the Fleet Builder and your screen should look like this:
- Enter
https://s3.amazonaws.com/amazon-reviews-pds/tsv/sample_us.tsv
into the File URL field. - Leave the Bucket Name as is.
- Leave the Expectation Suite as is.
- Click the gear icon on the sidebar of the Fleet Builder to open Fleet Settings.
- In this section, name your Fleet
GE - Sample Data - Amazon Reviews
- Click Save & Finish Fleet
- Immediately Click Run your Fleet
Step 7 - Review the Resultsβ
You should be immediately redirected to the actively running Fleet Log. Within the Log you'll be able to see all of the expectations and their output for the sample data.
You should also be able to see the validation file in your S3 bucket of choice.
Congratulations on setting up a Great Expectations Blueprint! You now have a repeatable solution that can be used again and again for all of your Expectation Suites.
What Comes Nextβ
Now that you've successfully worked your way through this tutorial, there's a lot of additional things that you can try out on your own with this knowledge.
Test Additional Variablesβ
Set up additional Vessels using the Great Expectations - Demo Blueprint and change just a few of the variables.
- Try using different Amazon Review Files found here. Some of them will cause failures because they don't meet all of the expectations within the Expectations Suite.
- Try leaving the Bucket Name blank.
- Try sending your data to a different bucket.
- Use the Sample expectation suite.
Tip: You can easily make multiple Vessels with slightly different Inputs by duplicating this tutorial Vessel.
Create New Variablesβ
Our tutorial may not have had enough flexibility to meet the general data demands of your organization. You can easily tweak the script to accomplish some of the following goals:
- Set a custom file name for the validation output.
- Pull files from other non-public sources.
- Allow options for different exit code conditions.
Expectation Suite Updatesβ
- Add your own expectation suite into the
great_expectations/expectations
folder, add the suite as a new Select Option in the Blueprint, and set up a new Vessel to use that expectation suite. - Update the existing
amazon-product-reviews
suite to include additional rules based on your own findings of the Amazon review data.