Introduction
Apache Airflow is an open-source platform for scheduling, managing, and computing data pipelines. It was first developed by Airbnb in 2014 and later was undertaken by Apache Software Foundation, therefore it is now an open-source project. Airflow leverages Python to develop workflows that are straightforward to schedule and monitor.
If you are just wondering what makes Airflow widely acceptable, here are some key points :
Open-source: Airflow is open-source and boasts a large, active user community.
Easy to use: You can design flexible workflows using Python without needing to learn any additional technologies or frameworks.
GUI: Compatible and user-friendly graphical user interface. You can monitor and manage workflows, as well as check the status of ongoing and completed tasks.
So, now the question that pops up right away is...
Why do we need Airflow?
Here are some reasons why Apache Airflow is a valuable tool:
Orchestrates Complex Workflows: Airflow is adept at managing intricate data pipelines, and handling the movement and processing of data from multiple sources. It ensures tasks run in the correct sequence and that all dependencies are properly managed.
Integrations and Customization: Airflow provides a comprehensive range of features for integrating with various external systems and customizing its functionalities to meet your specific requirements. By utilizing its integration and customization capabilities, Airflow can transform into a powerful orchestration tool that seamlessly connects with your existing data ecosystem and automates complex workflows tailored to your specific needs.
Scheduling and Automation: With Airflow, you can set up schedules for your workflows, enabling them to run automatically at specified times. This automation eliminates the need for manual intervention and ensures tasks are completed reliably.
Monitoring and Visualization: Airflow provides a web-based interface for monitoring your workflows. This interface allows you to track the status of tasks, identify and address errors, and visualize the overall progress of your pipelines.
Flexibility: Built with Python, Airflow is highly adaptable to various data sources and tools. You can create custom operators to seamlessly integrate Airflow with virtually any system.
Open Source: As an open-source tool, Airflow is free to use and benefits from a large community that provides support and continuous development.
Overall, Apache Airflow enhances the management and automation of complex workflows, significantly improving efficiency and reliability in data-driven tasks.
But wait, CI/CD also provides similar functionalities or you may also write Python scripts to automate tasks, then how Airflow differs from these tools should be the next question.
There is some overlap between Apache Airflow and CI/CD tools. Both involve automation, but they target different aspects of the software development lifecycle. Here’s a breakdown of the key differences:
CI/CD
Focus: Automates the build, testing, and deployment of software applications.
Typical Tasks:
Triggered by code changes in version control systems (VCS) like Git.
Builds the software from source code.
Runs automated tests to ensure code quality and functionality.
Deploys the software to different environments (staging, production).
Examples: Jenkins, GitLab CI/CD, CircleCI.
Apache Airflow
Focus: Orchestrates and schedules complex workflows, often involving data pipelines.
Typical Tasks:
Moving and processing data between various sources (databases, APIs, etc.).
Running data transformation or cleaning scripts.
Automating machine learning tasks.
Can integrate with CI/CD tools to trigger workflows after deployments.
Examples: Airflow is often used in conjunction with CI/CD tools, not as a replacement.
Python Scripts
Focus: This can be used for both CI/CD and Airflow workflows but at a more granular level.
Functionality:
It can serve as individual building blocks within CI/CD or Airflow pipelines.
It can be used for specific tasks like running unit tests, deploying to a server, or data cleaning steps.
Analogy
CI/CD: Like an assembly line for building cars, automating the core stages (welding, painting, etc.).
Airflow: Like the logistics system that manages the flow of parts and materials to the assembly line, ensuring things arrive at the right time and in the right order.
Python Scripts: The individual tools used on the assembly line (wrenches, drills, etc.).
Summary
CI/CD focuses on the software development lifecycle, handling tasks such as building, testing, and deploying applications.
Airflow orchestrates complex workflows, often involving data, ensuring tasks are executed in the correct sequence and dependencies are met.
They can work together for a comprehensive automation strategy, with Python scripts serving as the detailed instructions within both systems.
Airflow Architecture
Airflow's architecture is centered on Directed Acyclic Graphs (DAGs) and utilizes independent services for distributed execution.
This is a basic DAG (Directed Acyclic Graph) that shows us the overall comprehensive workflow of Airflow.
Key Components Breakdown:
Core Components:
DAGs (Directed Acyclic Graphs): These serve as the blueprints for workflows, defining tasks and their dependencies to specify the order of execution.
Metadata Database: Typically using PostgreSQL or MySQL, this database stores essential information about DAGs, tasks, configurations, and workflow execution history. All Airflow components coordinate through this database.
Web Interface:
- Webserver: This component offers a user interface for Airflow, enabling users to monitor DAG runs, view logs, manually trigger workflows, and manage configurations.
Workflow Execution:
Scheduler: Continuously monitors DAGs for scheduled tasks and triggers tasks for execution based on their dependencies as defined in the DAGs.
Executor: Manages task execution using external tools like Celery or Kubernetes. It receives tasks from the scheduler, queues them (if using Celery with a message broker), and distributes them to worker processes.
Worker Processes: Lightweight processes running on individual machines that execute the defined tasks and report their status back to the executor.
Optional Components (Distributed Systems):
- Message Broker (e.g., RabbitMQ, Redis): When using the Celery executor, this component serves as a central task queue. The scheduler sends tasks to the queue, which workers then pick up for execution.
Benefits of Distributed Architecture:
Scalability: Easily manage increasing workloads by adding more worker machines.
Fault Tolerance: In the event of a machine failure, tasks can be reassigned to other workers, minimizing downtime.
Security: Different components can be isolated to apply distinct security measures to each part of the system.
Airflow Hands-on
To get started with Airflow you first need to install it on to your machine. So, open your terminal and run the following commands:
pip install apache-airflow
airflow db init
To run the webserver and scheduler use these commands on separate terminals:
airflow webserver --port 8080
airflow scheduler
This should start the web server and you can access it at localhost:8080. So the login page pops up. You can either use the "airflow standalone" command to get a username and password.
Otherwise, you need to change the username and password in the airflow.cfg.
[webserver]
authenticate = True
auth_backend = airflow.contrib.auth.backends.password_auth
Change the user name and password here:
airflow users create -r Admin -u your_username -p your_password -e your_email@example.com -f First -l Last
And restart the webserver:
airflow webserver --restart
Once you are logged in you can see a page similar to this:
To see a simple demo of how airflow works let's write a python script that and a DAG that runs daily.
from datetime import datetime
def run_daily_task():
current_time = datetime.now()
print(f"Running daily task at {current_time}")
if __name__ == "__main__":
run_daily_task()
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.email_operator import EmailOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2024, 5, 13),
'email_on_failure': False,
'email_on_retry': False,
'email_on_success': True,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG(
'daily_task_dag',
default_args=default_args,
description='A simple DAG that runs a daily task',
schedule_interval=timedelta(days=1),
)
def run_daily_task():
current_time = datetime.now()
print(f"Running daily task at {current_time}")
run_daily_task_operator = PythonOperator(
task_id='run_daily_task',
python_callable=run_daily_task,
dag=dag,
)
email_notification = EmailOperator(
task_id='email_notification',
to='your-email-address',
subject='Daily Task Completed Successfully',
html_content='The daily task has been completed successfully.',
dag=dag,
)
run_daily_task_operator >> email_notification
The above scripts show that the dag is tasked to run the first Python script and then send a confirmation email to the respective email address. You need to make the following changes to your airflow.cfg file:
[smtp]
smtp_host = smtp.example.com
smtp_port = 587
smtp_starttls = True
smtp_ssl = False
smtp_user = your_smtp_username
smtp_password = your_smtp_password
smtp_mail_from = airflow@example.com
After configuring restart the server and scheduler. After it starts should enable your DAG.
Trigger the DAG and it should run successfully.