Page MenuHomePhabricator

[Research Engineering Request] Produce regular snapshots of all Wikipedia article topics
Closed, ResolvedPublic

Description

Goal

Produce a regular (likely monthly) snapshot of all Wikipedia articles and their predicted topics from the language-agnostic article topic model. The snapshot should be available via HDFS (e.g., Hive table) though it would also be nice to release the dump publicly.

Why

For large-scale analyses of content / editing trends, aggregating data by topic is a useful approach to understand the underlying dynamics. We have a topic classification model available on LiftWing that can do this for any article on Wikipedia but APIs aren't a great fit for handling millions of requests (as might easily be the case). Making topic predictions for every article on Wikipedia is relatively simple, however, with access to the classification model and link data present on the HDFS cluster.

For example: T290042 and T351114

Engineering required

Likely an Airflow job that collects the input data (article links) for all articles and runs them through the model before saving them to the appropriate table or dump. The proposed monthly cadence is because that is how frequently the pagelinks/redirects tables are uploaded to Hive and they are core components of the pipeline for producing the input features for the model (Wikidata snapshots are also required but they happen at more frequently time intervals). Some details are provided in T290042#7326209 but if we assume that the model is not going to be retrained as part of this process, then the relevant links are:

  • Generate links data for each article: https://github.com/geohci/wikipedia-language-agnostic-topic-classification/blob/master/outlinks/01a_build_outlinks_data_cluster.ipynb
    • Note: this is older so probably should be updated slightly such as using the canonical_data.wikis table to narrow down just to Wikipedia articles. And the pagelinks table maybe has changed in format?
  • Bulk predict assuming a TSV of all the article links and trained model: https://github.com/geohci/wikipedia-language-agnostic-topic-classification/blob/master/utils/bulk_predict.py
    • Note: I can just provide the current model binary but an ideal process probably has a way to download it directly from where LiftWing stores the model binaries so it's clearly linked to a model version there.
    • Note: this builds a dense output -- i.e. the scores for every article and topic -- but the actual predictions are much more sparse and in practice you can probably just produce table with only the article+topic pairs that exceed a threshold of 0.15. We recommend folks use the threshold of 0.5 for assigning topics but the lower threshold of 0.15 gives some leeway for folks to adjust this if they'd like while still removing the vast majority of irrelevant article+topic pairs.

In theory, you might be able to also use the LiftWing eventstream for the model to update a snapshot as the Search platform does. In practice, however, this would also require you to track article deletions, moves, etc. too and so producing a monthly snapshot from scratch probably is the simplest path.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
research:article-topics: update model url and create table if not existsrepos/data-engineering/airflow-dags!749mnzmnz/article-topics-dagmain
Customize query in GitLab

Event Timeline

Isaac renamed this task from Produce regular snapshots of all Wikipedia article topics to [Research Engineering Request] Produce regular snapshots of all Wikipedia article topics.Nov 20 2023, 6:44 PM

We reviewed this task in the backlog grooming meeting on November 21st. Given the limited capacity on the engineering front at this time and prioritization discussions (input by @fkaelin and @Miriam) we decided to prioritize T351674 instead. We will keep this task here as it is possible that we can pick it up in the coming 6 months. We will review it again in the future backlog grooming meetings.

fkaelin moved this task from Backlog to In Progress on the Research board.
  • This pipeline is implemented (MR)
  • Remaining work: schedule an airflow dag to regularly compute new topics dataset

I'm working with Inuka team to establish baselines for their KR related to key topic areas. Could you provide an estimated timeline for when this article topics dataset will be available on HDFS?

Hi @cchen, the May 2024 snapshot for the article topics dataset is available at hdfs:///tmp/research/article_topics/20240501_20240601. There's also an airflow DAG for this pipeline which will get deployed shortly allowing us to produce regular snapshots starting next month. Please let me know if you have any questions!

@MunizaA Thanks for updating the topic dataset!

@Isaac can we close this task? Anything you see that's not completed yet?

Generally I checked the parquet files mentioned above and they looked great as far as largely matching up with descriptive stats from past topic datasets! Two clarifying questions for @MunizaA before we do close out:

  • I presume we're keeping the most recent snapshot and not storing prior runs? If so, that makes sense to me. I could see justification for storing maybe the previous snapshot too (just to be able to easily detect changes if desired) but I see no reason for storing the topics from older runs.
  • Sorry I didn't spot this earlier but can we align with the model currently being used by LiftWing (assuming this is the model used by the DAG)?

I wanted to chime in that this data will be very useful for us as we look at high level metrics in the upcoming FY. Thank you Research team for prioritizing and completing this!
I wanted to add that having this data in a Hive table would be very helpful in making it user friendly. It would be very convenient if we could run SQL queries directly on it and use it to make charts in Superset, to analyze the topics that engage readers and editors.

Thanks @Isaac and @Mayakp.wiki for your feedback!

  • I presume we're keeping the most recent snapshot and not storing prior runs? If so, that makes sense to me. I could see justification for storing maybe the previous snapshot too (just to be able to easily detect changes if desired) but I see no reason for storing the topics from older runs.
  • Sorry I didn't spot this earlier but can we align with the model currently being used by LiftWing (assuming this is the model used by the DAG)?
  • We're keeping snapshots from the last 4 months (just to be safe since this is a new pipeline). Once a snapshot gets older than that, it gets deleted when the DAG runs.
  • Good catch! I've updated the version to match the one being used by LiftWing.

I wanted to add that having this data in a Hive table would be very helpful in making it user friendly. It would be very convenient if we could run SQL queries directly on it and use it to make charts in Superset, to analyze the topics that engage readers and editors.

A Hive table for this data has now been created:

spark-sql (default)> DESCRIBE TABLE research.article_topics;
col_name        data_type       comment
pid_from        bigint  id of the page
qid_from        string  wikidata QID of the page
outlinks        string  all target links from this page, separated by space
embedding       array<float>    embedding generated by the outlink topic model
predicted_labels        array<struct<label:string,probability:float>>   predicted topic labels
snapshot        string  mediawiki snapshot
wiki_db string  the wiki project
# Partition Information
# col_name      data_type       comment
snapshot        string  mediawiki snapshot
wiki_db string  the wiki project

If you have any questions or comments, let me know!

this all looks and sounds great -- thanks @MunizaA and good to resolve from my end!