-
-
Notifications
You must be signed in to change notification settings - Fork 60
Add Wikipedia as data source #167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great start!
I expect that more meaningful data is possible (in addition to default license and total articles):
- How do licenses differ across articles? Does API expose multiple licenses or just primary/default?
- How do licenses and counts compare across the different instances of Wikipedia (different languages)?
- etc.
|
@TimidRobot I was trying to get the count of articles by categories. But it seems a bit tricky because the structure of the categories are hierarchial. I have to loop into each sub categories recursively. Do you think I should give it a try? |
Please give an example of the categories |
|
@TimidRobot these categories: https://en.wikipedia.org/wiki/Category:Main_topic_classifications. |
I don't think the categories provide meaningful information on how the CC Legal Tools are being used and can be skipped. |
|
@TimidRobot ohh okay. Since wikipedia uses only one tool. Any idea on what way I can analyse the license? Maybe I can think in that direction. |
Count by language is good. I don't currently have any ideas about further analysis. Please resolve outstanding comments and then I'll re-review. |
|
@TimidRobot I have made the changes. I also followed the instruction for running scripts. I don't quite understand when you said I should make the script executable. |
Please see:
If a file is not executable, you have to specify the interpreter yourself, for example: pipenv run python ./scripts/1-fetch/github_fetch.py -hAn executable file can be run directly, for example: pipenv run ./scripts/1-fetch/github_fetch.py -hThe first line (called the shebang) tells the shell what to use to execute it (the interpreter): #!/usr/bin/env python |
|
@TimidRobot Thank you for the explanation! I think I am able to do it now. I just made a push, please check if I did it right. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep up the good work, nearly there
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
Including the names of the languages in English will make reporting clearer.
|
@TimidRobot I have made the changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point it does everything it needs to. Now we're just making it easier to use.
|
@TimidRobot I have updated the logic for when language_name_en is None or empty and I tested it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic work! Thank you 🙏🏻
|
@TimidRobot Thank you for the inputs also! |
Fixes
Description
Added
wikipedia_fetch.pyto implement wikipedia as a data source, currently it counts the number of articles across different instances of wikipedia(language).Wikipedia mainly uses the Creative Commons Attribution-Share Alike 4.0 license as primary license. We can fetch the following data:
Technical details
github_fetch.pyto implement thewikipedia_fetch.py. Similiar functions are used and I leveraged the code sample in the Wikipedia API documentation to structure the parameters for querying Wikipedia.rightsinfoto get information about the licenses used and statistics for the count of articles associated with the licenses.Checklist
Update index.md).mainormaster).visible errors.
Developer Certificate of Origin
For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."
Developer Certificate of Origin