Why Use Apache Airflow for Running Data Jobs at Scale

Introduction

At CAW, tackling complex challenges is second nature. What often starts as a simple problem becomes significantly harder when implemented at scale with consistent reliability.

This was the case with one of our products, where the challenge was to implement data collection and processing jobs at scale. Each job involves either scraping data from a website, calling an API, downloading a file, or gathering information from RSS feeds or Twitter.

Individually, these tasks are small and straightforward. However, traditional approaches break down when you need to add over 100 new jobs each month and orchestrate more than 1,000 jobs in production reliably.

To address this, we evaluated several solutions based on the following criteria:

– Complex scheduling with retry logic

– Horizontal scalability for new data sources

– Real-time monitoring for failures

– Instrumentation to optimize slow-performing jobs

– Visibility into the overall health of the system

Ultimately, we narrowed our options down to two contenders: AWS Glue and Apache Airflow. Let’s go over how both data engineering tools measure up against each other:

Feature	AWS Glue	Apache Airflow
Focus	All-in-one solution for everything related to data integration	Workflow management platform meant for orchestrating data pipelines
Customisability	A simple mapping-based generic solution, not very customisable	Code-based solution, highly customisable
Infrastructure	Serverless, managed service	Requires installation on user-managed servers; yet, there are managed solutions for seamless deployment
Licensing model	Paid cloud-managed service	Open source and managed as well (MWAA)
Monitoring and logging	Natively integrates with AWS CloudWatch	Requires separate configuration to support monitoring and logging. Can be connected to CloudWatch.
Degree of flexibility	Supports only Spark framework for implementing transformation tasks	Supports more execution frameworks since Airflow is a task-facilitation framework

It was a challenging decision between the two options. Ultimately, we chose Apache Airflow due to its open-source nature, which helped us avoid vendor lock-in.

What Is Apache Airflow?

Originally developed at Airbnb, Apache Airflow is a batch workflow orchestration platform where each job is represented as a Directed Acyclic Graph (DAG).

The Airflow framework includes operators that connect with various technologies and is easily extensible for connecting with new technologies. If workflows have a clear start and end and run at regular intervals, they can be effectively programmed as an Airflow DAG.

Understanding Apache Airflow Architecture

Apache Airflow, or ‘CRON on steroids’ as some like to refer to it, operates on the fundamental principle of DAGs. In this framework, one can create functions or methods that act as nodes of the DAG. The direction can be defined using simple arrow notations, ‘>>’.

Airflow also offers a variety of provider packages that allow for easy integration with third-party projects, including AWS, Microsoft Azure, Facebook, Google, and more. You can find the full list here.

Streamlined Configurability and Observability in Apache Airflow

Apache Airflow comes with a powerful dashboard that allows us to easily observe and configure scheduled times, retry logic, and enable or disable jobs. It also provides a wealth of metrics for monitoring job performance and error rates. Additionally, we can implement custom logging for individual jobs, making troubleshooting straightforward.

Implementation of Version 1 With AWS MWAA

We chose to implement our version 1 using Managed Workflows for Apache Airflow (MWAA) from AWS while retaining the flexibility to go completely self-hosted in the future.

Using MWAA allowed us to maintain an ops-lite approach for version 1. We used the ECSOperator to run tasks as ephemeral ECS tasks outside of Airflow’s network. Traditionally, regular tasks would run on Airflow’s worker nodes. However, for large jobs that require additional resources, this could lead to increased processor demand and slower execution times.

In such scenarios, the ECSOperator proves to be an ideal solution.

With the groundwork laid for implementation, we turned towards the essential components for data management: Connectors and Converters.

Connectors and Converters
For each data feed, we developed a DAG that executes two key tasks: a Connector and a Converter.

– Connectors: These contain the crawler logic that retrieves raw data from various sources, including websites, APIs, shared files, RSS feeds, and Twitter, and then saves this data to S3.

– Converters: After the data is stored in S3, the Converters download it, perform cleaning and transformation into STIX format, and feed the processed data into a queue. The process of converting raw data into a structured format often requires multiple iterations to stabilise.

Initially, you start with a few data feeds and design your conversion logic based on their specific formats. As you add more feeds, you encounter new formats, prompting you to iteratively expand your conversion logic.

To streamline this process, we decided to decouple the connectors from the converters. This allowed us to rerun the conversion logic after adding support for previously unhandled scenarios without the need to re-download the data.

Raw Data Storage
We used AWS S3 buckets to store raw data collected from various sources. The Connector task saved this data directly to the designated S3 bucket.

Processed Data Queue
The queue carries structured, cleaned, and valuable information that can be saved into the database. The converter sends the data into the queue for further processing by the Data Injector. We used the AWS SQS service to handle the data flow reliably.

Data Injector
As a consumer of the queue, the Data Injector de-duplicates, ranks, computes the confidence index, and inserts data into the database. We implemented an AWS Lambda function with an SQS trigger, enabling automatic processing whenever new data enters the queue.

Database
For storing the final structured data, we chose MongoDB. Given the non-relational nature of our data and the need for high availability, a NoSQL database was the ideal solution. MongoDB provided the flexibility and scalability we needed, making it an obvious choice.

Addressing the Setup Challenges With MWAA

One of the main challenges we faced was that MWAA only supports specific versions of Apache Airflow and comes with out-of-the-box plugins that often proved difficult to set up manually.

Common issues included dependency version mismatches and Python version incompatibilities. This made it particularly challenging for new developers to get familiar with the system and set up a new environment just to write a few DAGs.

After some research, we discovered a solution that allowed us to mimic an MWAA environment locally — the aws-mwaa-local-runner repository.

This repository provides a Command Line Interface (CLI) utility that replicates an Amazon Managed Workflows for Apache Airflow (MWAA) setup on a local machine. It includes:

– A Dockerfile derived from an Amazon Linux image that replicates the MWAA server.

– A command-line shell script to build an Airflow image with predefined dependencies and constraints.

– A DockerCompose file with a local PostgreSQL instance and the locally built Airflow Docker image. It also mounts the local DAGs directory, enabling us to test and run DAGs locally.

To run the Airflow instance locally, we follow these two commands:

1. ./mwaa-local-env build-image: This command generates a Docker image using the Dockerfile.

2. ./mwaa-local-env start: This command starts a local environment with a PostgreSQL database and a locally mounted DAGs folder.

This approach also gave us the flexibility to containerise our setup, making it easy to migrate to a self-hosted Airflow setup with minimal configuration changes whenever required.

Optimising for Code Reusability With a Smart Repository Structure

Most data sources followed similar patterns for querying, processing, and storing data. To ensure code reusability and maintain consistent standards across data feeds, we developed a smart codebase structure.

We used a MonoRepo structure for Connector/Converters and Data Inserters.

Connectors and Converters run as Airflow tasks, with the idea that all shared logic between data feeds resides in a central shared folder.

 
|-common
| |---> A Common pip package that contains common code sharable b/w multiple services
|
|-data_injector
| |---> Code for the Serverless that picks STIX entry from SQS and inserts into MongoDB
|
|-connectors_and_converters/
| |---> This is the Airflow's DAGs folder, contains code for data feed dags
| |
| |-shared/
| | |---> Contains common code amongst all DAGs
| | |
| | |- utilities
| | |- providers
| |
| |-feeds/
| |
| |-data_feed_1/
| | |- _init_.py
| | |- connector.py
| | |- convertor.py
| |
| |-data_feed_2/
| | |- _init_.py
| | |- connector.py
| | |- convertor.py
| |
| |-data_feed_3/
| |- _init_.py
| |- connector.py
| |- convertor.py
|-aws
| |---> All CloudFormation Scripts required to bring up the infra on AWS
|
|-docker
|---> Docker Compose files to run the local setup

As we implemented more data feeds and identified common patterns, we promoted new, reusable logic to the shared folder. We started with a few core utilities for scraping and uploading data to S3, and as the number of feeds grew, our shared logic expanded accordingly.

Efficient Deployment With MWAA

Since we decided to use MWAA, all DAG code needed to be deployed to an S3 bucket. We only had to write a GitHub Action that uses the AWS CLI to sync the connectors_and_converters directory to AWS S3.

xxxxxxxxxx
 
aws s3 sync ./connectors_and_converters/ 
s3://${{secrets.AWS_S3_BUCKET_DAGS}}/dags/ --delete --exclude "*.pyc"

Final Thoughts

Apache Airflow helped us come up with a manageable and scalable solution to run a large volume of data feeds. Thanks to its flexibility, we were able to add two new feeds every week. At this rate, we are confident in scaling up to ten new feeds per week by the end of a quarter.

If you’re looking for a solution to optimise your data workflows, schedule a free consultation call with us to discuss how we can achieve similar results for you.

Resources

Here are some useful resources to learn more about the core concepts discussed in this article:

– Apache Airflow Docs: https://airflow.apache.org/docs/apache-airflow/stable/index.html

– Getting Started With MWAA: https://docs.aws.amazon.com/mwaa/latest/userguide/get-started.html

– Writing and Running DAGs: https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#best-practices

– Logging and Monitoring in Airflow: https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/index.html

– Using the ECSOperator to Connect to Amazon ECS: https://docs.aws.amazon.com/mwaa/latest/userguide/samples-ecs-operator.html#samples-ecs-operator-code

– Working With Airflow Docker in Amazon MWAA: https://medium.com/@sohflp/how-to-work-with-airflow-docker-operator-in-amazon-mwaa-5c6b7ad36976

FAQs

What is Apache Airflow and how does it work?
Apache Airflow is an open-source batch workflow management platform that allows users to programmatically author, schedule, and monitor workflows. It represents jobs as Directed Acyclic Graphs (DAGs), enabling complex task dependencies and scheduling at scale.

Why choose Apache Airflow over AWS Glue for data orchestration?
Owing to its open-source nature, Apache Airflow provides flexibility and avoids vendor lock-in. Additionally, Airflow’s capabilities for complex scheduling, horizontal scalability, real-time monitoring, and comprehensive observability make it a better fit for managing a large volume of data feeds.

How does Apache Airflow handle error monitoring and job performance?
Apache Airflow comes with a robust dashboard that allows users to observe scheduled jobs, monitor performance metrics, and track error rates. Custom logging for individual jobs simplifies troubleshooting and provides insights into job execution.

Can Apache Airflow be deployed on AWS, and what benefits does that provide?
Yes, Apache Airflow can be deployed on AWS using Managed Workflows for Apache Airflow (MWAA). This deployment option offers a streamlined, ops-lite approach while retaining the flexibility to transition to a self-hosted environment when needed.