Skip to content

Conversation

@oree-xx
Copy link
Contributor

@oree-xx oree-xx commented Oct 10, 2025

Fixes

Description

Added wikipedia_fetch.py to implement wikipedia as a data source, currently it counts the number of articles across different instances of wikipedia(language).
Wikipedia mainly uses the Creative Commons Attribution-Share Alike 4.0 license as primary license. We can fetch the following data:

  • Count of articles by language: This tells the usage of CC_BY_SA_4.0 across the different instances of wikipedia.
  • Count of articles by categories (English wikipedia): Breakdown of the use of CC_BY_SA 4.0 by categories.
  • Count of page views by categories: We can get the most viewed set of articles by categories.

Technical details

  • I used the structure in github_fetch.py to implement the wikipedia_fetch.py. Similiar functions are used and I leveraged the code sample in the Wikipedia API documentation to structure the parameters for querying Wikipedia.
  • The parameters include rightsinfoto get information about the licenses used and statistics for the count of articles associated with the licenses.
  • To get the count of articles across all the wikipedia languages I used the sitematrix parameter.

Checklist

  • I have read and understood the Developer Certificate of Origin (DCO), below, which covers the contents of this pull request (PR).
  • My pull request doesn't include code or content generated with AI.
  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main or master).
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

Developer Certificate of Origin

For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

Copy link
Member

@TimidRobot TimidRobot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start!

I expect that more meaningful data is possible (in addition to default license and total articles):

  • How do licenses differ across articles? Does API expose multiple licenses or just primary/default?
  • How do licenses and counts compare across the different instances of Wikipedia (different languages)?
  • etc.

@oree-xx oree-xx marked this pull request as ready for review October 12, 2025 12:06
@oree-xx oree-xx requested review from a team as code owners October 12, 2025 12:06
@oree-xx oree-xx requested review from TimidRobot and possumbilities and removed request for a team October 12, 2025 12:07
@oree-xx
Copy link
Contributor Author

oree-xx commented Oct 14, 2025

@TimidRobot I was trying to get the count of articles by categories. But it seems a bit tricky because the structure of the categories are hierarchial. I have to loop into each sub categories recursively. Do you think I should give it a try?

@TimidRobot TimidRobot self-assigned this Oct 14, 2025
@TimidRobot
Copy link
Member

@TimidRobot I was trying to get the count of articles by categories. But it seems a bit tricky because the structure of the categories are hierarchial. I have to loop into each sub categories recursively. Do you think I should give it a try?

Please give an example of the categories

@oree-xx
Copy link
Contributor Author

oree-xx commented Oct 14, 2025

@TimidRobot these categories: https://en.wikipedia.org/wiki/Category:Main_topic_classifications.
But I have done the count of article by languages in my PR.

@oree-xx oree-xx changed the title [WIP] Added wikipedia as data source Added wikipedia as data source Oct 14, 2025
@TimidRobot
Copy link
Member

@TimidRobot these categories: https://en.wikipedia.org/wiki/Category:Main_topic_classifications. But I have done the count of article by languages in my PR.

I don't think the categories provide meaningful information on how the CC Legal Tools are being used and can be skipped.

@TimidRobot TimidRobot changed the title Added wikipedia as data source Add wikipedia as data source Oct 15, 2025
@TimidRobot TimidRobot changed the title Add wikipedia as data source Add Wikipedia as data source Oct 15, 2025
@oree-xx
Copy link
Contributor Author

oree-xx commented Oct 15, 2025

@TimidRobot ohh okay. Since wikipedia uses only one tool. Any idea on what way I can analyse the license? Maybe I can think in that direction.
Also any comment on the count of article by language?

@TimidRobot
Copy link
Member

@TimidRobot ohh okay. Since wikipedia uses only one tool. Any idea on what way I can analyse the license? Maybe I can think in that direction. Also any comment on the count of article by language?

Count by language is good.

I don't currently have any ideas about further analysis. Please resolve outstanding comments and then I'll re-review.

@oree-xx
Copy link
Contributor Author

oree-xx commented Oct 15, 2025

@TimidRobot I have made the changes. I also followed the instruction for running scripts. I don't quite understand when you said I should make the script executable.

@TimidRobot
Copy link
Member

TimidRobot commented Oct 15, 2025

@TimidRobot I have made the changes. I also followed the instruction for running scripts. I don't quite understand when you said I should make the script executable.

@oree-xx

Please see:

If a file is not executable, you have to specify the interpreter yourself, for example:

pipenv run python ./scripts/1-fetch/github_fetch.py -h

An executable file can be run directly, for example:

pipenv run ./scripts/1-fetch/github_fetch.py -h

The first line (called the shebang) tells the shell what to use to execute it (the interpreter):

#!/usr/bin/env python

@oree-xx
Copy link
Contributor Author

oree-xx commented Oct 15, 2025

@TimidRobot Thank you for the explanation! I think I am able to do it now. I just made a push, please check if I did it right.

Copy link
Member

@TimidRobot TimidRobot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep up the good work, nearly there

@oree-xx

This comment was marked as outdated.

Copy link
Member

@TimidRobot TimidRobot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

Including the names of the languages in English will make reporting clearer.

@oree-xx
Copy link
Contributor Author

oree-xx commented Oct 20, 2025

@TimidRobot I have made the changes.

Copy link
Member

@TimidRobot TimidRobot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point it does everything it needs to. Now we're just making it easier to use.

@oree-xx
Copy link
Contributor Author

oree-xx commented Oct 20, 2025

@TimidRobot I have updated the logic for when language_name_en is None or empty and I tested it.

Copy link
Member

@TimidRobot TimidRobot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work! Thank you 🙏🏻

@TimidRobot TimidRobot merged commit b023018 into creativecommons:main Oct 21, 2025
@github-project-automation github-project-automation bot moved this from In review to Done in TimidRobot Oct 21, 2025
@oree-xx
Copy link
Contributor Author

oree-xx commented Oct 21, 2025

@TimidRobot Thank you for the inputs also!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Add Wikipedia as a new data source

2 participants