Goal: Make Research's reusable ML tooling easy to consume in ml-pipelines, reducing duplication and easing migrations—without heavy frameworks.
Deliverables:
- Shared code project in ml-pipelines (name TBD)
- Own deps + CI (tests incl. Spark snapshot, lint, build, publish)
- Publish a wheel to GitLab PyPI for ML projects and Research
- Seed from research-datasets (research_transformation, etc.); replace in-tree copies
- Pin offline workload deps (e.g., Spark/Iceberg); notebook-friendly usage
- Utilities only; no framework base classes
- Command API (opt-in)
- Generate typed dataclasses from entry-point args; use in Airflow tasks
- Better defaults, type safety, IDE hints; easier usage in Jupyter
- Pilot in one project before broader adoption
Affected repos:
- Migrate shared code from research-datasets → new shared project in ml-pipelines
- Update dependent projects to import the published wheel
- Minimal DAG updates in airflow-dags for pilot if needed
Out of Scope
- Migrating specific pipelines (revert-risk, add-a-link, inference, etc.)
- Mandating Command API across all pipelines
Acceptance
- Shared project created; CI green; wheel published; brief docs
- At least one ML project consumes the package
- Command API used in one pilot