-
-
Notifications
You must be signed in to change notification settings - Fork 59
Description
Problem
Hello, right now, the project collects data from Google Custom Search and GitHub, and work on adding Wikipedia is already in progress via PRs #176 and #167. Also, @TimidRobot commented about “more meaningful data” (for Wikipedia) suggesting they expect more than just basic counts — but WikiCommons they hasn’t been addressed yet.
However, WikiCommons is also an important source for Creative Commons–licensed media, and it’s not yet part of the automated system.
There’s an older version of it under pre-automation/wikicommons/, but it hasn’t been updated to the new structure.
Description
Work can be done on adding WikiCommons as a new data source using the MediaWiki API.
This would collect counts of CC-licensed media files (like images, videos, and audio) by license type.
The plan is to:
- Review the old
pre-automation/wikicommons_scratcher.pyscript. - Rewrite it to match the new 3-step workflow (1-fetch, 2-process, 3-report).
- Make sure the new script uses the current shared helpers and output format.
This will help the project measure CC-licensed media content more accurately.
Alternatives
It could be combined with the Wikipedia data, but keeping it separate makes it easier to track media content specifically.
Additional context
- Old script:
pre-automation/wikicommons/ - API: MediaWiki Action API (https://commons.wikimedia.org/w/api.php)
- Builds on similar work done for Wikipedia (Add Wikipedia as a new data source #159, Add Wikipedia as data source #167, [DISCARDED] Add Wikipedia as a new data source (wikipedia_fetch.py) #176)
Implementation
- I would be interested in implementing this feature.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status