DEV Community

Karen Langat
Karen Langat

Posted on

Building a Dockerized Cryptocurrency ETL Pipeline with Apache Airflow

Modern data engineering is built around automation, orchestration, and scalable infrastructure. In this project, I built a Dockerized ETL pipeline that collects crypto data from the CoinPaprika API, processes it using Apache Airflow, and stores it in a PostgreSQL database hosted on Aiven Cloud.

The goal of the project is to simulate a real-world data engineering workflow while learning:

  • Workflow orchestration with Apache Airflow
  • Containerization with Docker
  • API data extraction
  • Cloud database integration(Aiven)

The entire pipeline runs inside Docker containers, making it portable, reproducible, and easy to deploy.

Architecture Overview

The project has four main components:

  • CoinPaprika API: an API providing real-time cryptocurrency market data
  • Apache Airflow: to orchestrate the ETL workflow using TaskFlow API
  • Docker: containerizes the entire Airflow environment for reproducibility
  • Aiven PostgreSQL: a cloud PostgreSQL instance where processed data is stored

It follows a simple ETL workflow:

  • Extract cryptocurrency market data from the CoinPaprika API
  • Transform and clean the data
  • Load the processed data into PostgreSQL
  • Schedule and orchestrate everything using Airflow

Project Structure

crypto-etl-project/

├── dags/
│ └── crypto_etl_dag.py

├── scripts/
│ └── requirements.txt

├── .env
├── docker-compose.yml
├── Dockerfile
└── README.md

Docker Overview

Docker is an open-source platform designed to build, deploy, and manage applications inside lightweight, isolated containers. Using Docker ensures that the application behaves consistently across environments.

Why is Docker Important

  • Consistency: Eliminates the "it works on my machine" problem by standardizing environments.
  • Efficiency: Containers are lightweight and share the host system's kernel, making them more efficient than virtual machines.
  • Portability: Applications can be easily moved between local machines, cloud providers, and data centers.
  • Security: Provides isolated environments for running applications and managing dependencies securely

Setting Up the Docker Environment

Dockerfile

A Dockerfile contains instructions used to buid the Docker Image.
The Dockerfile extends the official Apache Airflow image and installs the Python dependencies required for the ETL pipeline.

FROM apache/airflow:2.8.1
USER airflow

COPY scripts/requirements.txt .

RUN pip install --no-cache-dir --progress-bar off -r requirements.txt

RUN sed -i 's/self.timer.start()/#self.timer.start()/' \
    /home/airflow/.local/lib/python3.8/site-packages/limits/storage/me>
Enter fullscreen mode Exit fullscreen mode

What this does

  • Uses the official Airflow image as the base
  • Copies project dependencies into the container
  • Installs Python packages
  • Applies a workaround for a rate-limiting issues inside Airflow dependencies

docker-compose.yml

Is used to define and manage container applications. In this project, it is used to configure and run airflow services.

services:
  airflow:
    build: .
    container_name: crypto_airflow
    ports:
      - "8080:8080"

    env_file:
      - .env

    environment:
      AIRFLOW__CORE__LOAD_EXAMPLES: "false"
      AIRFLOW_HOME: /opt/airflow
      AIRFLOW_USER_HOME_DIR: /home/airflow
      PYTHONPATH: "/home/airflow/.local/lib/python3.8/site-packages"
      PATH: "/home/airflow/.local/bin:/usr/local/bin:/usr/bin:/bin"
      AIRFLOW__WEBSERVER__RATELIMIT_ENABLED: "false"
      AIRFLOW__WEBSERVER__WORKERS: "1"
      AIRFLOW__WEBSERVER__WORKER_TIMEOUT: "300"
      OPENBLAS_NUM_THREADS: "1"
      OMP_NUM_THREADS: "1"

    volumes:
      - ./dags:/opt/airflow/dags

    entrypoint: ["/bin/bash", "-c"]

    command:
      ["ulimit -u unlimited && exec /usr/local/bin/python3.8 -m ..."]

    privileged: true
Enter fullscreen mode Exit fullscreen mode

Key Configurations

Port Mapping

It exposes the Airflow web UI locally

ports:
  - "8080:8080"
Enter fullscreen mode Exit fullscreen mode
Volume Mounting
volumes:
  - ./dags:/opt/airflow/dags
Enter fullscreen mode Exit fullscreen mode

This allows Airflow to automatically detect DAG changes from the local machine.

Environment Variables

The configuration disables unnecessary example DAGs and optimizes worker settings.

Building the ETL Pipeline

The core project logic lives inside the Airflow DAG. The DAG automates the ETL tasks.

Extract

The pipeline fetches cryptocurrency data from the CoinPaprika API.
Typical data collected includes:

  • Coin name
  • Symbol
  • Price
  • Market capitalization
  • Trading volume
  • Timestamp

Transform

The raw JSON response is cleaned and structured using Python. The transformations includes:

  • Selecting relevant fields
  • Renaming columns
  • Handling missing values
  • Converting timestamps
  • Formatting numeric values This stage ensures the data is analytics-ready before storage.

Load

The cleaned data is inserted into a PostgreSQL database hosted on Aiven Cloud.
Using a managed cloud database removes the burden of infrastructure maintenance while simulating a production-ready setup.

Aiven PostgreSQL

Aiven provides managed PostgreSQL with SSL enabled by default. Credentials are stored in a .env file:

AIVEN_USER=avnadmin
AIVEN_PASSWORD=<your_password>
AIVEN_HOST=<your_host>.aivencloud.com
AIVEN_PORT=<your_port>
AIVEN_DB=defaultdb
Enter fullscreen mode Exit fullscreen mode

Running the Pipeline

Step 1: Build the Containers

docker compose build

Step 2: Start the Services

docker compose up

Step 3: Access Airflow

http://localhost:8080
Then Trigger the DAG and monitor.

Challenges Faced

Most of the challenges I faced were during the Docker setup.

Rate Limiting Issues

A rate-limiting issue inside one of the Airflow dependencies required patching part of the package using:
sed -i 's/self.timer.start()/#self.timer.start()/'

Thread Creation Errors

Fixed by adding --progress-bar off to pip and running the container with privileged: true.

Cloud Database Connectivity

Connecting Docker containers to a managed PostgreSQL instance required careful handling of:

  • SSL configurations
  • Environment variables
  • Network permissions

Conclusion

This project provided practical exposure to several important data engineering concepts:

  • Designing automated ETL pipelines
  • Working with orchestrators like Airflow
  • Containerizing applications with Docker
  • Integrating APIs into data workflows
  • Using managed cloud databases
  • Structuring production-style projects

It also demonstrated how modern data platforms combine orchestration, infrastructure, and automation into scalable systems.

GitHub Repository: Crypto_etl

Top comments (0)