Modern data engineering is built around automation, orchestration, and scalable infrastructure. In this project, I built a Dockerized ETL pipeline that collects crypto data from the CoinPaprika API, processes it using Apache Airflow, and stores it in a PostgreSQL database hosted on Aiven Cloud.
The goal of the project is to simulate a real-world data engineering workflow while learning:
- Workflow orchestration with Apache Airflow
- Containerization with Docker
- API data extraction
- Cloud database integration(Aiven)
The entire pipeline runs inside Docker containers, making it portable, reproducible, and easy to deploy.
Architecture Overview
The project has four main components:
- CoinPaprika API: an API providing real-time cryptocurrency market data
- Apache Airflow: to orchestrate the ETL workflow using TaskFlow API
- Docker: containerizes the entire Airflow environment for reproducibility
- Aiven PostgreSQL: a cloud PostgreSQL instance where processed data is stored
It follows a simple ETL workflow:
- Extract cryptocurrency market data from the CoinPaprika API
- Transform and clean the data
- Load the processed data into PostgreSQL
- Schedule and orchestrate everything using Airflow
Project Structure
crypto-etl-project/
│
├── dags/
│ └── crypto_etl_dag.py
│
├── scripts/
│ └── requirements.txt
│
├── .env
├── docker-compose.yml
├── Dockerfile
└── README.md
Docker Overview
Docker is an open-source platform designed to build, deploy, and manage applications inside lightweight, isolated containers. Using Docker ensures that the application behaves consistently across environments.
Why is Docker Important
- Consistency: Eliminates the "it works on my machine" problem by standardizing environments.
- Efficiency: Containers are lightweight and share the host system's kernel, making them more efficient than virtual machines.
- Portability: Applications can be easily moved between local machines, cloud providers, and data centers.
- Security: Provides isolated environments for running applications and managing dependencies securely
Setting Up the Docker Environment
Dockerfile
A Dockerfile contains instructions used to buid the Docker Image.
The Dockerfile extends the official Apache Airflow image and installs the Python dependencies required for the ETL pipeline.
FROM apache/airflow:2.8.1
USER airflow
COPY scripts/requirements.txt .
RUN pip install --no-cache-dir --progress-bar off -r requirements.txt
RUN sed -i 's/self.timer.start()/#self.timer.start()/' \
/home/airflow/.local/lib/python3.8/site-packages/limits/storage/me>
What this does
- Uses the official Airflow image as the base
- Copies project dependencies into the container
- Installs Python packages
- Applies a workaround for a rate-limiting issues inside Airflow dependencies
docker-compose.yml
Is used to define and manage container applications. In this project, it is used to configure and run airflow services.
services:
airflow:
build: .
container_name: crypto_airflow
ports:
- "8080:8080"
env_file:
- .env
environment:
AIRFLOW__CORE__LOAD_EXAMPLES: "false"
AIRFLOW_HOME: /opt/airflow
AIRFLOW_USER_HOME_DIR: /home/airflow
PYTHONPATH: "/home/airflow/.local/lib/python3.8/site-packages"
PATH: "/home/airflow/.local/bin:/usr/local/bin:/usr/bin:/bin"
AIRFLOW__WEBSERVER__RATELIMIT_ENABLED: "false"
AIRFLOW__WEBSERVER__WORKERS: "1"
AIRFLOW__WEBSERVER__WORKER_TIMEOUT: "300"
OPENBLAS_NUM_THREADS: "1"
OMP_NUM_THREADS: "1"
volumes:
- ./dags:/opt/airflow/dags
entrypoint: ["/bin/bash", "-c"]
command:
["ulimit -u unlimited && exec /usr/local/bin/python3.8 -m ..."]
privileged: true
Key Configurations
Port Mapping
It exposes the Airflow web UI locally
ports:
- "8080:8080"
Volume Mounting
volumes:
- ./dags:/opt/airflow/dags
This allows Airflow to automatically detect DAG changes from the local machine.
Environment Variables
The configuration disables unnecessary example DAGs and optimizes worker settings.
Building the ETL Pipeline
The core project logic lives inside the Airflow DAG. The DAG automates the ETL tasks.
Extract
The pipeline fetches cryptocurrency data from the CoinPaprika API.
Typical data collected includes:
- Coin name
- Symbol
- Price
- Market capitalization
- Trading volume
- Timestamp
Transform
The raw JSON response is cleaned and structured using Python. The transformations includes:
- Selecting relevant fields
- Renaming columns
- Handling missing values
- Converting timestamps
- Formatting numeric values This stage ensures the data is analytics-ready before storage.
Load
The cleaned data is inserted into a PostgreSQL database hosted on Aiven Cloud.
Using a managed cloud database removes the burden of infrastructure maintenance while simulating a production-ready setup.
Aiven PostgreSQL
Aiven provides managed PostgreSQL with SSL enabled by default. Credentials are stored in a .env file:
AIVEN_USER=avnadmin
AIVEN_PASSWORD=<your_password>
AIVEN_HOST=<your_host>.aivencloud.com
AIVEN_PORT=<your_port>
AIVEN_DB=defaultdb
Running the Pipeline
Step 1: Build the Containers
docker compose build
Step 2: Start the Services
docker compose up
Step 3: Access Airflow
http://localhost:8080
Then Trigger the DAG and monitor.
Challenges Faced
Most of the challenges I faced were during the Docker setup.
Rate Limiting Issues
A rate-limiting issue inside one of the Airflow dependencies required patching part of the package using:
sed -i 's/self.timer.start()/#self.timer.start()/'
Thread Creation Errors
Fixed by adding --progress-bar off to pip and running the container with privileged: true.
Cloud Database Connectivity
Connecting Docker containers to a managed PostgreSQL instance required careful handling of:
- SSL configurations
- Environment variables
- Network permissions
Conclusion
This project provided practical exposure to several important data engineering concepts:
- Designing automated ETL pipelines
- Working with orchestrators like Airflow
- Containerizing applications with Docker
- Integrating APIs into data workflows
- Using managed cloud databases
- Structuring production-style projects
It also demonstrated how modern data platforms combine orchestration, infrastructure, and automation into scalable systems.
GitHub Repository: Crypto_etl
Top comments (0)