Share code between Research & ML teams
Open, MediumPublic
Actions

Assigned To

Authored By

	fkaelin
	Jul 8 2025, 2:21 PM

Description

Goal: Make Research's reusable ML tooling easy to consume in ml-pipelines, reducing duplication and easing migrations—without heavy frameworks.

Deliverables:

Shared code project in ml-pipelines (name TBD)
- Own deps + CI (tests incl. Spark snapshot, lint, build, publish)
- Publish a wheel to GitLab PyPI for ML projects and Research
- Seed from research-datasets (research_transformation, etc.); replace in-tree copies
- Pin offline workload deps (e.g., Spark/Iceberg); notebook-friendly usage
- Utilities only; no framework base classes
Command API (opt-in)
- Generate typed dataclasses from entry-point args; use in Airflow tasks
- Better defaults, type safety, IDE hints; easier usage in Jupyter
- Pilot in one project before broader adoption

Affected repos:

Migrate shared code from research-datasets → new shared project in ml-pipelines
Update dependent projects to import the published wheel
Minimal DAG updates in airflow-dags for pilot if needed

Out of Scope

Migrating specific pipelines (revert-risk, add-a-link, inference, etc.)
Mandating Command API across all pipelines

Acceptance

Shared project created; CI green; wheel published; brief docs
At least one ML project consumes the package
Command API used in one pilot

Details

Due Date: Dec 30 2025, 12:00 AM

Related Objects

Mentioned In: T382072: Offline pipelines
T375291: Research Infrastructure component accountability
T398249: [Q1 FY 25-26 Applied Sciences Team] Building the Foundations Research
T398950: Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines

Event Timeline

fkaelin created this task.Jul 8 2025, 2:21 PM

fkaelin updated the task description. (Show Details)Jul 8 2025, 2:29 PM

fkaelin added a project: Research-engineering.Jul 9 2025, 2:44 PM

fkaelin added a subscriber: Miriam.

OKarakaya-WMF subscribed.Jul 16 2025, 10:19 AM

Miriam assigned this task to fkaelin.Jul 17 2025, 8:38 AM

Thanks @fkaelin for creating this task. We discussed this with @isarantopoulos and there are a few things we should figure out before we start this migration:

Ownership of the migrated code
What pipelines from Research does ML need? Do we need to migrate everything?
What are the dependencies?
How do we work together for this migration (roles and responsibilities, workflows etc)

@isarantopoulos please let me know if I am missing something. Happy to create a separate task for the above.

Miriam moved this task from Backlog to Staged on the Research board.Jul 17 2025, 8:42 AM

OKarakaya-WMF mentioned this in T398950: Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines.Jul 17 2025, 9:08 AM

isarantopoulos added a project: Machine-Learning-Team.Jul 22 2025, 12:35 PM

kevinbazira subscribed.Jul 22 2025, 2:34 PM

Miriam mentioned this in T398249: [Q1 FY 25-26 Applied Sciences Team] Building the Foundations Research.Jul 22 2025, 3:02 PM

diego subscribed.Jul 23 2025, 5:07 PM

Miriam set Due Date to Mon, Sep 29, 11:00 PM.Jul 28 2025, 9:53 AM

Miriam moved this task from Staged to In Progress on the Research board.

This should be interesting for this task:
https://gitlab.wikimedia.org/repos/research/research-common

Ottomata subscribed.Jul 30 2025, 3:51 PM

Miriam mentioned this in T375291: Research Infrastructure component accountability.Jul 31 2025, 10:38 AM

Miriam mentioned this in T382072: Offline pipelines.

Miriam triaged this task as Medium priority.Aug 26 2025, 11:00 AM

Miriam added a subscriber: Sucheta-Salgaonkar-WMF.Sep 17 2025, 2:54 PM

isarantopoulos moved this task from Unsorted to In Progress on the Machine-Learning-Team board.Sep 18 2025, 9:32 AM

Miriam moved this task from In Progress to FY2025-26-Research-October-December on the Research board.Thu, Oct 16, 12:52 PM

Miriam edited projects, added Research (FY2025-26-Research-October-December); removed Research.

Miriam changed Due Date from Mon, Sep 29, 11:00 PM to Dec 30 2025, 12:00 AM.Thu, Oct 16, 12:58 PM

fkaelin updated the task description. (Show Details)Thu, Oct 16, 2:54 PM

Weekly updates

The suggested contributions are described in this doc
Ongoing discussion regarding ML dev tooling (with notebook support). Prototype repo is wmfing, replacing the archived research-commons repo.

Very cool!

The 'Command API' sounds like a very useful thing outside of just ML and Research jobs. As you build it, perhaps it would make sense to consult with Data Engineering @amastilovic @mforns @xcollazo etc. to see if there might be a way to build it for general use?

Share code between Research & ML teamsOpen, MediumPublicActions

Description

Details

Related Objects

Event Timeline

Share code between Research & ML teams
Open, MediumPublic
Actions