Skip to content

Automating WikiCommons Data Source #180

@najuna-brian

Description

@najuna-brian

Problem

Hello, right now, the project collects data from Google Custom Search and GitHub, and work on adding Wikipedia is already in progress via PRs #176 and #167. Also, @TimidRobot commented about “more meaningful data” (for Wikipedia) suggesting they expect more than just basic counts — but WikiCommons they hasn’t been addressed yet.

However, WikiCommons is also an important source for Creative Commons–licensed media, and it’s not yet part of the automated system.
There’s an older version of it under pre-automation/wikicommons/, but it hasn’t been updated to the new structure.

Description

Work can be done on adding WikiCommons as a new data source using the MediaWiki API.
This would collect counts of CC-licensed media files (like images, videos, and audio) by license type.

The plan is to:

  • Review the old pre-automation/wikicommons_scratcher.py script.
  • Rewrite it to match the new 3-step workflow (1-fetch, 2-process, 3-report).
  • Make sure the new script uses the current shared helpers and output format.

This will help the project measure CC-licensed media content more accurately.

Alternatives

It could be combined with the Wikipedia data, but keeping it separate makes it easier to track media content specifically.

Additional context

Implementation

  • I would be interested in implementing this feature.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions