Azure Data Factory - CICD with YAML Pipelines

Introduction Overview

Azure Data Factory (ADF) is Azure's cloud ETL service for scale-out serverless data integration and data transformation. It has a code-free UI with many maintenance-free connectors, allowing it to be maintained by non-coders. No matter how much or little code is needed for the ADF, it is still important to have multiple environments for deployment, ensuring the production ADF is in a secure consumption-ready state, meaning Continuous Integration/ Continuous Delivery (CI/CD) is essential. Something that I have observed is that while there are documentation and guides provided on how to implement ADF CI/CD, most of them only document the process by using the classic editor, rather than through YAML pipelines, which can be seen in Microsoft documentation here.

Through this post, I will try to explain and show the process through YAML, which is the newer and recommended way of working. This will allow for easier collaboration in teams (allows for code reviews, pull requests, and in-code comments), and easier to compare versions/ revert to previous versions. This post will not go through Azure platform resource deployments, only the deployment of ADF resources, such as the data pipelines, datasets, and data flows.

Implementation Overview

The process of ADF CI/CD consists of two separate pipelines. The first pipeline will build and publish the code of the Git-integrated ADF, usually integrated with the development environment. The second pipeline will be a release pipeline that follows from the first pipeline, where it deploys the Azure Resource Manager (ARM) templates and parameter files to higher environments. The entire flow, from the creation of resources to deploying to higher environments, can be seen in the diagram below.

Pre-requisites

All Azure resources required are already deployed.
The development instance of the ADF should be Git integrated.

Git integration should be done with Azure DevOps.
Pipelines used for CI/CD should be hosted on Azure DevOps.
A service principal will need to be created for the subscription you are deploying to.
Self-hosted integration runtimes (IR) should have the same name across all environments.

Git Integration

Git is the industry standard for code versioning, and is highly recommended to be used with the ADF for the following reasons:

It is essential for creating an automated CI/CD process.
Reverting to previous versions will be easier.
Removing large amounts of content/ renaming values is easier by directly editing the JSON file.

Git integration will provide the ADF in the linked repository of choice via JSON files, consisting of many parameter files and ARM templates. Do note that ADF only supports Git integration with Azure DevOps and GitHub Repositories.

By default, an adf_publish branch will be created, where any changes and updates published will appear in the adf_publish branch. The branch name for publishing can also be configured to a different branch name.

The pipelines to deploy the ADF resources should be situated in the main branch, as adf_publish should only capture the changing aspects of the ADF.

Build and Publish Pipeline

The build and publish pipeline will build and publish all the contents of the ADF that is Git integrated. This will checkout and publish all the build artifacts in the adf_publish branch, allowing the contents to be consumed in the release pipeline. A sample snippet of the code can be seen below.

trigger: none 

resources: 
  repositories: 
    - repository: adf-publish 
      type: git 
      ref: adf_publish 
      name: '<Project Name>/<Repository Name>' 
 
jobs: 
- job: PublishDataFactory 
  displayName: Publish Data Factory 
  pool: 
    vmImage: windows-2019 
  steps: 
  - checkout: adf-publish 
    path: publish 
    fetchDepth: 1 
  - task: PublishBuildArtifacts@1 
    displayName: 'Publish Artifact: drop' 
    inputs: 
      PathtoPublish: adf-dev-sample

Release Pipeline

The release pipeline deploys the ADF to environments such as test and production through a series of stages. As there are no changes needed after the build and publish pipeline has run, it can be set to be triggered upon completion of the build and publish pipeline. The sample code snippet below displays how each stage calls the template YAML file to deploy the resources, with parameters specific to each environment.

name: data-factory-release-pipeline-$(Build.BuildId) 

trigger: none 

variables: 
- name: serviceConnection 
  value: '<Your service connection>'  
- name: subscriptionId 
  value: <Your subscription id>  
- name: resourceGroup 
  value: 'rg-tst-sample' 
- name: dataFactory 
  value: 'adf-tst-sample' 

pool: 
  vmImage: ubuntu-latest 

resources: 
  pipelines: 
  - pipeline: adf-build-and-publish 
    source: adf-build-and-publish 
    project: <Project Name> 
    trigger: true 

  repositories: 
    - repository: adf-publish 
      type: git 
      ref: adf_publish 
      name: '<Project Name>/<Repository Name>' 

stages: 
- stage: DeployToTest 
  displayName: DeployToTest 
  jobs: 
  - template: templates/data-factory.release.template.yaml 
    parameters: 
      azureSubscription: $ 
      subscriptionId: $ 
      environment: 'test-environment' 
      resourceGroupName: $ 
      dataFactoryName: $

Template File

The template YAML file contains three main jobs which deploy the actual ADF, the pre-deployment step, the deployment step, and the post-deployment step. In this section, I will go into detail about each of the steps. The code snippet for the template file can be seen below.

parameters: 
  - name: azureSubscription 
    type: string 
  - name: subscriptionId 
    type: string 
  - name: environment 
    type: string 
  - name: resourceGroupName 
    type: string 
  - name: dataFactoryName 
    type: string 

jobs: 
  - job: PreDeploymentScript 
    displayName: 'Disable Triggers' 
    pool: 
      vmImage: 'ubuntu-latest' 
    steps: 
    - checkout: adf-publish 
      path: publish 
    - task: AzurePowerShell@5 
      displayName: 'Azure PowerShell script: Pre-deployment' 
      inputs: 
        azureSubscription: $ 
        ScriptPath: '$(Agent.BuildDirectory)/publish/adf-dev-sample/scripts/PrePostDeploymentScript.ps1' 
        ScriptArguments: '-armTemplate ''$(Agent.BuildDirectory)/publish/adf-dev-sample/ARMTemplateForFactory.json'' -ResourceGroupName $ -DataFactoryName $ -predeployment $true -deleteDeployment $false' 
        azurePowerShellVersion: LatestVersion 

  - job: Deployment 
    displayName: 'Deployment' 
    pool: 
      vmImage: 'ubuntu-latest' 
    dependsOn: PreDeploymentScript 
    steps: 
    - checkout: adf-publish 
      path: publish 
    - task: AzureResourceManagerTemplateDeployment@3 
      displayName: 'ARM Template deployment: Resource Group scope' 
      inputs: 
        azureResourceManagerConnection: $ 
        subscriptionId: $ 
        resourceGroupName: '$' 
        location: 'Australia East' 
        csmFile: '$(Agent.BuildDirectory)/publish/adf-dev-sample/ARMTemplateForFactory.json' 
        csmParametersFile: '$(Agent.BuildDirectory)/publish/adf-dev-sample/ARMTemplateParametersForFactoryTest.json' 
        overrideParameters: '-factoryName $' 
 
  - job: PostDeploymentScript 
    displayName: 'Enable Triggers' 
    dependsOn: Deployment 
    pool: 
      vmImage: 'ubuntu-latest' 
    steps: 
    - checkout: adf-publish 
      path: publish 
    - task: AzurePowerShell@5 
      displayName: 'Azure PowerShell script: Post-deployment copy' 
      inputs: 
        azureSubscription: $ 
        ScriptPath: '$(Agent.BuildDirectory)/publish/adf-dev-sample/scripts/PrePostDeploymentScript.ps1' 
        ScriptArguments: '-armTemplate ''$(Agent.BuildDirectory)/publish/adf-dev-sample1/ARMTemplateForFactory.json'' -ResourceGroupName $ -DataFactoryName $ -predeployment $false -deleteDeployment $false' 
        azurePowerShellVersion: LatestVersion

Pre-Deployment Step

The pre-deployment step ensures that the ADF being deployed is in a ready state, as there are conditions that need to be met to ensure a successful deployment. The main target of this step is to disable the triggers in the environment being deployed, as changes to any active triggers will fail during deployment. This is achieved by using the PowerShell script, PrePostDeploymentScript.ps1, provided by official Microsoft documentation here. The documentation and the sample code snippet should also include all the arguments needed for the script, but the key thing to note is that the -predeployment parameter is set to true.

Deployment Step

The deployment step deploys all the resources to the ADF in higher environments. It will rely on the ARM templates and parameter files that have been built and published by the “Build and Publish” pipeline.

Post-Deployment Step

The post-deployment step is very similar to the pre-deployment step. The only difference is that the -predeployment parameter is set to false. This step will ensure all the triggers in the environment being deployed are started up again so that the ADF is ready for consumption.

Parameters

Parameter files are generated only for the environment which the ADF is Git integrated. The primary parameter file used is the ARMTemplateParametersForFactory.json file. Global parameters are not added in the ARMTemplateParametersForFactory.json file by default. But this can be done by ticking the “Include global parameters in ARM template” box as shown in the image below.

This is an action I recommend, as it means there are fewer steps required with parameter deployment and fewer files to maintain.

To ensure higher environments are being deployed with relevant parameters, you can either use the “overrideParameters” argument for each parameter or create multiple parameter files specific to each environment. I recommend the latter as it allows easier maintenance of the parameters. Additionally, if further environments need to be created, it allows easier and faster creation of a new parameter file to reference.

Points to Consider

In the event of using a self-hosted integration runtime (self-hosted IR), the name should be the same across all environments, as the ARM template generated stores the name directly inside the template. That makes it very hard to parameterize the self-hosted IR name
Alerts are not in the ARM template generated by the ADF, so these will need a separate template created, and another step in the pipeline to deploy.

Wrapping Up

Hopefully, this post has outlined the simplicity of ADF CI/CD and has encouraged you to implement it through YAML pipelines rather than through the classic editor. Despite seeming like there are more things and steps to consider, the result is much easier to read, easier to reuse, and a lot more satisfying to see work!

Articles

Azure Data Factory - CICD with YAML Pipelines

Leave a comment on this blog:

Latest Posts

Articles

Azure Data Factory - CICD with YAML Pipelines

Leave a comment on this blog:

You might also enjoy:

Latest Posts