Modern Data Stack Conference 2020 - Top 5 Takeaways

The Modern Data Stack Conference (MDSC2020) just ended last week after two days full of presentations covering everything from Data Team structures to Data Infrastructure setups. It was our first time attending a fully virtual conference that was packed full of useful information!

Throughout all of the sessions, here are the top takeaways that we came away with.

1. ELT is the Foundation of the Modern Data Stack

The general consensus was that with modern cloud databases like Snowflake, Bigquery, and Redshift, it's much easier to subscribe to ELT, where you start by dumping all of your raw data into a database. Once it lives in the database, you can use SQL to manipulate and clean the data as needed, creating tables on the fly. This structure gives quite a few benefits:

  • Data Engineers have less responsibility to transform the with the right business logic beforehand.
  • There's less code involved to get data into a usable state, so Analysts using SQL can get their hands on data quicker.
  • If something ends up wrong with your data sets, it's as easy as re-running a SQL query.
  • Data teams have "infinite time" freed up because questions can be answered without needing to involve them.
  • Data usage can be monitored to better understand organizational needs and priorities.
Source: Fraser Harris - MDSCON2020

We're big believers that ELT is the future, so we were excited to see so many people sharing the same beliefs!

2. dbt is a Force to be Reckoned With

In almost every single presentation we saw, dbt (data build tool) was mentioned as the best way to build out your organization's data tables, views, and more. With ELT taking over, more teams are dumping all of the raw data directly into their database and using tools like dbt to manage the creation, testing, and version control for data tables being generated. This setup also empowers analysts with strong SQL knowledge to be responsible for the data setup, without having to rely as heavily on Data Engineers.

Source: Tristan Handy - MDSCON2020

While it's been great to see the growth of dbt over the past few years, there's still an information gap to close to show data teams how and why dbt should be used. We believe we'll see a lot more guides and integrations with dbt crop up over the next year.

3. Self-Service Data complicates Data Governance

As teams transition to the ELT and self-service models, it's easy to end up with a ton of data but little control over how people access and use it. Data Governance is a topic that's becoming more complex as organizations expand the amount of data sources they have on hand.

One team decided that they weren't going to bother managing data access directly in their data warehouse. Instead, they use what they deem to be "Transitive Permissions". If a user has access to Service A, they can load data for Service A using their own credentials and query it as they see fit. Each user becomes responsible for their own data. While this could mean that data gets duplicated in the warehouse, it also means that no team is ever waiting on a centralized team to make their data available.

Another team had an interesting approach where they loaded as much raw data from external services as they possibly could into their Data Warehouse and released it to everyone. Instead of building out the right data models initially, they put systems in place to track and monitor data usage by the analytics team. As patterns in data usage emerged, they used this information to prioritize how organizational "Source of Truth" data should be developed.

We're interested to see how these different strategies pan out for organizations. If you have an alternative strategy you use, we'd love to hear about it!

4. Focus on getting your Data Fundamentals Right

Source: Jacob Bedard- MDSCON 2020
Source: Michael Kaminsky - MDSCON 2020

Making a predictive Machine Learning model. Analyzing sentiment with NLP. Incorporating AI decision-making. Everyone wants to do advanced, glamorous things with their data. However, you can really start shooting yourself in the foot if you focus too much on these fancier activities without getting the basics right.

Michael Kaminsky of Locally Optimistic and Jacob Bedard of Dialpad both referenced the "Data Pyramid of Priorities" as an adaptation of Maslow's hierarchy of needs. While the content differed slightly, the core concept remained the same. You have to focus on getting the data fundamentals right before moving to the next step with your data.

Unsurprisingly, the bottom of the tier focuses on "Data Storage" and "Data Trust". Too many organizations try to jump right in with their data, attempting to use it immediately to prove it's value. However, if you don't have data stored consistently and you aren't proactively building alerts for data issues, something will inevitably go wrong. These issues build up over time, eroding trust in your team and your data that you worked so hard to build.

Try to shore up your data foundation, run through scenarios of everything that could go wrong with your data and put systems in place to ensure these situations don't happen. Use tools like Great Expectations to monitor the data quality throughout your data pipelines. When issues occur, immediately let the relevant parties know that you're aware, what steps you're taking to fix it, and how long they can expect the issue to last. Only through these steps can you strengthen your organization's data and start working towards advanced usage.

5. "Last Mile" Data Actioning is still Underserved

There's plenty of tools out there to facilitate loading data, transforming data, and visualizing data in the market. Once the data is loaded and cleaned, and you're past the first few steps in the Pyramid of Priorities, how do you start using it to its full potential? These "Last Mile" solutions cover a wide array of situations but are incredibly important towards driving business value. Potential solutions include:

  • How do you enhance your CRM information?
  • How do you make your data fuel your marketing decisions?
  • How do you automate internal processes?
  • How do you deploy your ML model to production?
  • How do you distribute reports to your clients?
  • How do you monitor and alert on data quality issues?

We're proud that Shipyard is actively addressing this need, making it easier than ever for teams to quickly deploy solutions that act on their data in a matter of minutes. If you're struggling to act on data quicker, try us out!

---

Thanks again to the team at Fivetran for putting on this great virtual conference and all of the speakers that gave up their time to share their learnings. Looking forward to seeing everyone at the next one!


About Shipyard:
Shipyard is a modern data orchestration platform for data engineers to easily connect tools, automate workflows, and build a solid data infrastructure from day one.

Shipyard offers low-code templates that are configured using a visual interface, replacing the need to write code to build data workflows while enabling data engineers to get their work into production faster. If a solution can’t be built with existing templates, engineers can always automate scripts in the language of their choice to bring any internal or external process into their workflows.

The Shipyard team has built data products for some of the largest brands in business and deeply understands the problems that come with scale. Observability and alerting are built into the Shipyard platform, ensuring that breakages are identified before being discovered downstream by business teams.

With a high level of concurrency and end-to-end encryption, Shipyard enables data teams to accomplish more without relying on other teams or worrying about infrastructure challenges, while also ensuring that business teams trust the data made available to them.

For more information, visit www.shipyardapp.com or get started for free.