-
-
Notifications
You must be signed in to change notification settings - Fork 59
Description
Problem
Each data source in the project currently follows a similar workflow that involves fetching, processing, and report generation. However, these steps are executed through individual scripts, which can make the process difficult to automate, monitor, and maintain — especially when scaling to multiple data sources.
Description
I propose using Apache Airflow to orchestrate and automate the data workflows.
Each data source can be represented as an Airflow DAG, with tasks corresponding to each stage:
-
Fetch: Collect data from APIs or external sources.
-
Process: Clean, transform, and aggregate the data.
-
Report: Generate summaries or metrics for analysis.
This approach would improve visibility, scheduling, error handling, and reusability across all data pipelines.
Alternatives
-
Continue using manual script execution
-
Implement lightweight scheduling within Python scripts.
Additional context
This integration aligns with the existing three-phase workflow structure and can easily wrap around current scripts without major codebase refactoring. Airflow also supports modular task development, which would benefit future contributors.
Implementation
- I would be interested in implementing this feature.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status