Continuous Integration and Deployment in Real-Time Data Processing with Jenkins

Aekanun Thongtae
5 min readOct 13, 2023

Introduction

In the rapidly evolving landscape of Big Data, real-time processing goes beyond just managing vast data streams — it’s about reacting to them promptly. Building on insights from a prior article on “Unlocking Real-Time Insights: Building a Stream Processing Solution with Apache Beam, Google Cloud Dataflow, and Terraform,” this article emphasizes the imperative of streamlined and efficient operations. Enter Jenkins, an undisputed hero in the CI/CD domain, adept at automating, testing, and seamlessly deploying data pipelines.

Both CI and CD hold paramount importance in data processing. While CI integrates novel data transformations and detects potential snags early on, safeguarding pipeline quality, CD expedites the introduction of these thoroughly vetted modifications. This ensures state-of-the-art data methodologies and algorithms are continually refined and in operation, sharpening our insights.

Key Players in this Domain:

  • Apache Beam: Renowned for batch and stream processing unification.
  • Google Cloud Dataflow: The go-to for scalable data processing.
  • Terraform: A powerhouse for infrastructure orchestration.

While these tools are influential, Jenkins stands out, offering a superior edge in fine-tuning data processing.

1. Jenkins in the Big Data Landscape

Jenkins does more than just automate; it’s pivotal in meticulously testing and deploying real-time data processing, ensuring harmony with tools like Apache Beam, Google Cloud Dataflow, and Terraform.

1.1 Why Jenkins?

  • Automated Testing: For every code alteration, Jenkins springs into action, safeguarding Big Data’s integrity.
  • Continuous Deployment: Assures that Big Data applications, encapsulated in Docker, perform at their pinnacle.
  • Feedback Loop: Jenkins’ real-time alerts are indispensable for maintaining pipeline resilience.

2. Setting up Jenkins for Data Pipelines

DevOps, Development & Deployment Pipeline and Stream Data Pipeline

2.1 The Jenkins Workflow

  • Clean Workspace: It’s all about starting anew, making room for fresh integrations and retiring outdated ones.
  • Checkout: With Jenkins, you’re always on the cutting edge — it fetches the most up-to-date code.
  • Run Tests: It’s not just about checking the code. Jenkins ensures that novel data methods and algorithms are performing as anticipated.
  • Deployment Prep: Once green-lit, Jenkins gears up the code for launch.
  • Build Docker Image: The culmination — your data application is encased in a Docker image.

2.2 Before You Begin with Jenkins

  • An optimized Jenkins server.
  • Essentials: GitHub, Docker, and Pipeline plugins.
  • Credentials for Docker registry sign-in.

3. Jenkins and GitHub: A Seamless Communication

In the realm of Continuous Integration (CI) and Continuous Deployment (CD), Jenkins acts as the heart of automation. Its interplay with repositories like GitHub ensures that changes are automatically tested and subsequently deployed, facilitating an efficient development lifecycle.

3.1 Setting Up GitHub Webhooks

The primary mechanism through which Jenkins detects code changes in a GitHub repository is via webhooks.

  • Webhook Configuration: A webhook in the GitHub repository settings notifies Jenkins of code alterations instantaneously.
  • Payload URL: It’s the Jenkins server’s address to which GitHub dispatches updates about repository occurrences. Generally, it resembles: http://your-jenkins-server/github-webhook/.

3.2 Authenticating Jenkins with GitHub

To fetch the code and gather repository details, Jenkins requires permissions.

  • Generating an Access Token: This token, created in GitHub under user settings, furnishes Jenkins with repository access rights.
  • Storing Token in Jenkins: By storing this token in Jenkins (either as ‘Secret Text’ or using ‘Username with password’ — where the username is your GitHub username and the password is the token), Jenkins can engage with GitHub securely.

3.3 Jenkins GitHub Plugin

To enhance communication, Jenkins employs the GitHub plugin.

  • Installation: It can be incorporated via the Jenkins plugin manager.
  • Configuration: Post-installation, you’ll need to tweak it by entering the GitHub server specifics and the credentials.

3.4 Pulling Code and Triggering Builds

With webhooks and credentials established:

  • Code Changes: Upon a developer pushing code to the GitHub repository, the webhook instantly intimates Jenkins.
  • Triggering Builds: Depending on the Jenkinsfile within the repository or Jenkins’s pipeline configuration, the build/test/deploy mechanism gets underway.

3.5 Reporting Status to GitHub

Jenkins has the capability to revert build statuses back to GitHub.

  • Commit Status Publisher Plugin: Through plugins of this ilk, Jenkins communicates build statuses back to GitHub.
  • Feedback Loop: It guarantees developers receive instantaneous feedback on their submissions, promoting swift error resolutions and quality code integrations.

4. Implementing Jenkins in Big Data Processing

To provide clarity, here’s an illustrative Jenkins pipeline script (Jenkinsfile) with inline comments:

pipeline {
agent any

stages {
stage('Clean Workspace') { // Clearing workspace for fresh operations
steps { cleanWs() }
}
stage('Checkout') { // Fetching the most recent code
steps { checkout scm }
}
stage('Run Tests') { // Executing tests to validate code
steps {
sh 'pwd' // Print working directory
sh 'ls -la test/' // List test directory contents
sh 'pytest test/' // Run tests using pytest
}
}
stage('Prepare for Deployment') { // Getting ready for deployment post successful tests
steps { sh 'echo "Tests successful!"' }
}
stage('Build Docker Image') { // Creating a Docker image
steps {
withCredentials([usernamePassword(credentialsId: 'dockerhub_creds', usernameVariable: 'USER', passwordVariable: 'PASS')]) {
sh 'docker login -u $USER -p $PASS' // Docker login
}
sh 'docker build -t mydataapp/cicd_pipeline:latest .' // Building Docker image
sh 'docker push mydataapp/cicd_pipeline:latest' // Pushing image to repository
}
}
}
post {
success { echo 'All tasks accomplished without a hitch!' } // Success message
failure { echo 'Encountered a hiccup.' } // Failure message
}
}

To get Jenkins up and running:

  • Kickstart a New Job: Opt for ‘New Item’ in Jenkins, label your pipeline, choose ‘Pipeline’, and click ‘OK’.
  • Configuring the Pipeline: In the pipeline settings, select ‘Pipeline script from SCM’, choose your version control system (like Git), and input your Jenkinsfile’s repository URL.
  • Initiation: Click ‘Build Now’. Jenkins swings into action, performing each stage in sequence, keeping you in the loop about any glitches.
The Jenkins CI/CD pipeline provides a clear view of the development process. While there’s a minor hiccup at the “Build Docker Image” stage, all other green stages signify smooth operations. This small alert is instrumental in ensuring quick error detection before any code reaches the users.

4. Conclusion

Incorporating Jenkins into Big Data workflows is a game-changer, enhancing efficiency exponentially. By synergizing Jenkins with heavy-hitters like Apache Beam, Google Cloud Dataflow, and Terraform, you’re setting the stage for swift and unwavering data outcomes. How has Jenkins transformed your data processes? Share your journey with us!

--

--

Aekanun Thongtae

Experienced Senior Big Data & Data Science Consultant with a history of working in many enterprises and various domains . Skilled in Apache Spark, and Hadoop.