written by
Jonathan Dorsey

DataOps — DevOps For Data… Right?

DevOps Tools 10 min read

The great thing about working with different customers is that you’re often given the opportunity to challenge your opinions and preconceived notions. We recently experienced this with one of our healthcare clients.

We were helping our customer transform their on-premise data warehouse into a cloud-native data pipeline and platform. We worked with data engineers building Azure Data Factory pipelines calling into Databricks notebooks to transform and sanitize data. We are at our best building pipelines and automating the software supply chain; embracing the principles and practices around DevOps. Naturally, we applied this same approach to DataOps. We really wanted to treat the Data Factory instances and Databricks Workspaces as environments and the corresponding pipelines and notebooks as code.

Our initial thought was something along the lines of “this is kind of the same thing as DevOps only they've just rebranded it for data.” Except it wasn’t. The deeper we got into the project the more we realized that the basic concept might have been the same, but at its core, DataOps is its own beast.

What is DataOps?

The core concepts behind DataOps certainly sound familiar to anyone living in the world of DevOps the last decade. Break down silos, create cross-functional ownership, people over process — a lot of the language sounds the same. Ultimately, it is all about getting reliable (and actionable) data to those who need it quickly, safely, and at scale — just like software.

In the traditional extract, transfer, load (ETL) scenario you are reacting to historical data, driving in the rearview mirror. You have to take yesterday's batch run process that extracted that data, spend a day or more running it through a series of transformations, and then that can be loaded. End users of the data are always looking back to make sure that the organization is on the right path and moving slowly because the road isn’t great. With DataOps, you can transform the ETL process so it picks up data as it is available, at scale, and gets it to end users immediately so you can start making decisions proactively. Rather than always looking back, you can spend more time looking forward at what lane you should be in — or what exit you might need to take.

So we just need to automate everything so we deliver quickly. We can treat data pipelines and analytics models like software. A data engineer is just a software developer, but with data, right?

What did we observe while following DevOps practices?

The first thing that we noticed was that the workflow for a data engineer is quite a bit different than the traditional app dev workflow. If you are familiar with software development, you can appreciate the value in working locally. It is much easier to run builds and execute tests locally before pushing code off to a remote environment to see what happens. This simply isn’t possible when dealing with terabytes (or even petabytes) of data. Data engineers can’t (and likely shouldn’t even if they could) store data locally on their machines. Instead, there is a push to do your work within close proximity of a more controlled data store and a lot of the tooling is built with this assumption.

In addition, the tooling itself is not quite ready to embrace full automation for deployments and updates. Below we will dive into each of these tools to highlight the areas where our expectations didn’t quite align with reality.

Azure Data Factory

Azure Data Factory is a managed, serverless data integration service with the goal of orchestrating data workloads at scale. Its strength is the ability to orchestrate complex ETL pipelines and operations, with integrations for over 90 data sources, all within a GUI-based designer.

Like many tools in the data space, there is a push to make all changes in the provided web UI. Generally, this poses a problem when working on large teams with many separate, unrelated changes happening simultaneously. Fortunately, Azure Data Factory does have a pretty solid integration with Git, enabling engineers to work in isolation on their own branches until they are ready to submit a pull request and merge their changes back into the trunk of the repository.

However, what happens when you are ready to take your changes in your dev environment and deploy them to test, staging, or production? Currently, the only recommended approach is to leverage Microsoft’s ARM Templates through a semi-manual process. The workflow would look something like this:

  1. New branch is created in Git to track the engineer’s work in isolation from other team members. This impacts everything the engineer can interact with — pipelines, data sources, and even linked services.
  2. Engineer applies changes and runs tests to validate in Azure Data Factory.
  3. Changes are committed to git and a pull request is opened.
  4. The pull request is reviewed, approved, and merged into the main branch for the repository.
  5. Normally, this is where our deployment automation would take over, package the changes, and deploy them to the appropriate environment. However, with Data Factory engineers are working on json representations of the pipelines, data sets, and configuration. We need Azure ARM Templates that can be deployed to environments. Generating these templates is done with an additional step in the Data Factory UI. The engineer (or someone else on the team) would open Azure Data Factory and select the `publish` button from the main branch.
  6. By publishing the changes, this tells Data Factory to generate updated ARM Templates based on the latest changes. These templates are then stored in a special `adf_publish` branch in your repository. NOTE: This branch is wholly owned by Data Factory. Any manual changes to this branch can and will be overwritten during each publish action.
  7. Now the deployment automation can take over to deploy the changes to the appropriate environment(s) — with some caveats listed below.

The first caveat is that you generally can’t simply deploy the exact same ARM template to multiple environments. Things like the name of the Data Factory instance, linked services like Storage Accounts and Key Vaults, or potentially on-premise resource locations differ between environments. These differences must be accounted for as parameters to the ARM template and the appropriate values must be supplied as inputs when the ARM template is applied. Data Factory does have a solution to help parameterize the appropriate values, but it relies on a specially-named file in the Git repository with a schema to pinpoint the path in the resulting ARM template to be parameterized. This can be overwhelming for a data engineer who is not familiar with deploying ARM Templates in Azure.

The second caveat is that, even after parameterizing the appropriate values and providing the correct inputs, you still can’t just apply the ARM Template. There are other updates that have to be made to disable active triggers in Azure Data Factory which is not trivial. Microsoft provides a sample powershell script that can be run before and after the deployment to ensure the appropriate updates are made.

Databricks

Databricks is a data engineering and data science platform, built around Apache Spark. It has capabilities to perform ETL operations, as well as facilitate data analytics ML and exploration workflows. As with Data Factory, it is centered around its own purpose-built, GUI-based editor.

It’s a powerful platform, but we noticed that it struggled with source control management issues. Unlike Azure Data Factory, it was not possible to create a separate branch that allows you to work in isolation. While each notebook can be synced individually to a specific file in a Git repository, this introduces two problems. First, it can be a bit cumbersome to sync each file individually instead of syncing the entire workspace to an entire repository. Second, even when individual notebooks are synced to Git, individual users are not able to change the branch for that notebook in isolation. By updating the branch for that notebook, the branch is updated for everyone who logs in to that workspace. There did appear to be functionality coming to sync workspaces to repositories, but it was still in preview and it does not appear to solve the branching issue as their recommendation still involves cloning to a user folder in order to sync to a new user branch.

In addition to the above problems, there were some defects within the UI showing a lack of maturity with the Git integration. For example, the UI would assume all branching be done from `master` if a different branch had not been selected yet — regardless of whether or not `master` existed. Sometimes existing branches would not correctly load in a drop down of branch options until after the modal window was closed and reopened. Hopefully these issues are cleared up with the new approach for Git integrations coming soon.

All of this caused an interesting workflow for engineers who were forced to clone notebooks, sync those cloned notebooks to their new Git branch, and then clean up those clones when the code was merged back into the main branch. One quick note here — some of this could likely be avoided if engineers were working locally and utilizing something like the VS Code Databricks Extension. By working locally, engineers would be able to more easily branch their code without impacting others while also running tests remotely within the Databricks Workspace. Due to company policy, that wasn’t an option for us and engineers were required to work within the web UI.

The great news is that once code was merged into the main branch, it was really easy to deploy the code to any environment using Terraform and the Databricks provider. The workflow for working through the Databricks UI would look something like this:

  1. Navigate to the notebook that is to be updated. Clone the notebook to your user space.
  2. Sync your cloned notebook to Git. NOTE: Be sure to sync it to the existing file with the correct path and name in the repository.
  3. Create a new branch where the work will be performed.
  4. Apply your changes to the notebook.
  5. Save your changes (which can push to Git) and open a PR
  6. The pull request is reviewed, approved, and merged into the main branch of the repository.
  7. The cloned notebook can be deleted from the user workspace.

There is one caveat to note on this workflow as well. You may notice at the end of the workflow above the suspicious file history in Databricks. It shows the previously loaded version of the notebook as a recent revision that should be saved instead of showing the notebook with the latest version from Git. To view the latest version from Git, it has to be explicitly restored from the revision history. While there may be ways to work around this, we decided to simply not sync our main notebooks to Git at all. Instead, each time a change is merged into the main branch, the notebooks in the shared Databricks workspace are always overwritten with the latest version from Git. This allows Git to become the source of truth and avoids the opportunity for users to inadvertently revert changes from Databricks.

Final thoughts

Where DataOps really shines is in its potential to scale data projects with greater ease. When you have one or two environments with five or six developers you can easily do things the standard way.

But what happens when you scale? How do you handle 30+ data engineers and scientists interacting with 3-5 separate environments? As you scale it out, if it becomes a significant problem. It's hard to solve without automation and changing the way the work is viewed.

We essentially want to take all the hard lessons learned from DevOps and apply them to DataOps to enable teams to scale more quickly and efficiently. How you do development might be a little bit different, but you still have to have all that automation to coordinate deployments and testing across different environments.

DataOps opens up your ability to scale and it makes it possible to work faster. However, the newer Data platforms and supporting tooling still have more maturing to do before we no longer feel the bleeding edge. It’s getting there, though, and the lines between DevOps and DataOps will only continue to blur.

Want to know about DataOps? Let’s talk. We’d love to help you explore what this means for your organization or hear about your experiences in the field.

enterprise