Skip to content

Automate data pipeline orchestration using Apache Airflow #182

@Joyakis

Description

@Joyakis

Problem

Each data source in the project currently follows a similar workflow that involves fetching, processing, and report generation. However, these steps are executed through individual scripts, which can make the process difficult to automate, monitor, and maintain — especially when scaling to multiple data sources.

Description

I propose using Apache Airflow to orchestrate and automate the data workflows.
Each data source can be represented as an Airflow DAG, with tasks corresponding to each stage:

  • Fetch: Collect data from APIs or external sources.

  • Process: Clean, transform, and aggregate the data.

  • Report: Generate summaries or metrics for analysis.

This approach would improve visibility, scheduling, error handling, and reusability across all data pipelines.

Alternatives

  • Continue using manual script execution

  • Implement lightweight scheduling within Python scripts.

Additional context

This integration aligns with the existing three-phase workflow structure and can easily wrap around current scripts without major codebase refactoring. Airflow also supports modular task development, which would benefit future contributors.

Implementation

  • I would be interested in implementing this feature.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions