In T401021#11286092, @achou wrote:

Summary for yesterday's meeting (doc)

Data Model:

Use “MVCC” revision_id in composite key

Update Pipeline:

Streaming updates in LiftWing+ChangeProp

With intentions / commitment for DPE to examine and hopefully built platform support for this in near future.

@Eevans For next steps, we would like to have the instance ready so we can begin working on bootstrap/initial ingestion to Cassandra. Do you have an estimated timeline for when this can be ready? Is there anything else you need from us? :)

Wed, Oct 22, 12:10 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence-Design-Review, Revise-Tone-Structured-Task, OKR-Work, Machine-Learning-Team, Growth-Team, Data-Persistence

Tue, Oct 21

Eevans updated the task description for T401021: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task.

Tue, Oct 21, 9:58 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence-Design-Review, Revise-Tone-Structured-Task, OKR-Work, Machine-Learning-Team, Growth-Team, Data-Persistence

Eevans updated the task description for T401260: Global Editor Metrics - Data Persistence Design Review.

Tue, Oct 21, 4:39 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence-Design-Review, Data-Persistence

Mon, Oct 20

Eevans added a comment to T407414: aqs1012 is down.

In T407414#11289616, @Eevans wrote:

In T407414#11285096, @Jclark-ctr wrote:

@Eevans are you able to reimage the server i have had no luck due to no root partition error. and preseed file has -efi for raid configuration for a server setup for legacy bios?

I haven't tried (and wouldn't trust putting it back into production without first understanding the failure that got us here). That preseed is supposed to work with with legacy bios (even though it supports uefi). It's the same preseed that was last used to install that host last (back in...July-ish, I think?), along with all of the sessionstore hosts (also legacy bios).

TL;DR I think there is something else wrong here.

Mon, Oct 20, 2:07 PM · SRE, DC-Ops, ops-eqiad

Eevans added a comment to T407414: aqs1012 is down.

In T407414#11285096, @Jclark-ctr wrote:

@Eevans are you able to reimage the server i have had no luck due to no root partition error. and preseed file has -efi for raid configuration for a server setup for legacy bios?

Mon, Oct 20, 2:05 PM · SRE, DC-Ops, ops-eqiad

Wed, Oct 15

Eevans triaged T407414: aqs1012 is down as High priority.

Wed, Oct 15, 5:51 PM · SRE, DC-Ops, ops-eqiad

Eevans created T407414: aqs1012 is down.

Wed, Oct 15, 5:51 PM · SRE, DC-Ops, ops-eqiad

Eevans added a comment to T405942: eqiad row C/D Data Persistence host migrations.

In T405942#11273802, @RobH wrote:

[ ... ]

In T405942#11268506, @Eevans wrote:

Provided that the moves happen one at a time (probably goes without saying), then the Cassandra hosts can be done at any time, and without coordination. The Cassandra hosts here are: aqs*, restbase*, & sessionstore*

aqs*, restbase*, & sessionstore can be done anytime without coordination. @Eevans: So no icinga notice and just move the network port without further interactions with the OS or services? If so, that is by far the easiest. Do they require any time between hosts? That is 6 aqs hosts, 5 restbase, and 2 sessionstore, so less than 6 business days.

So if we need to do anything other than move the port, please let us know.

Wed, Oct 15, 5:37 PM · media-backups, DBA, Data-Persistence, SRE, DC-Ops, ops-eqiad

Mon, Oct 13

Eevans added a comment to T405942: eqiad row C/D Data Persistence host migrations.

Provided that the moves happen one at a time (probably goes without saying), then the Cassandra hosts can be done at any time, and without coordination. The Cassandra hosts here are: aqs*, restbase*, & sessionstore*

Mon, Oct 13, 1:23 PM · media-backups, DBA, Data-Persistence, SRE, DC-Ops, ops-eqiad

Fri, Oct 10

Eevans added a comment to T401021: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task.

In T401021#11264016, @isarantopoulos wrote:

[...]

If possible I would be interested in decoupling the schema decision discussed in the task from the update/ingestion mechanism and its architecture and my understanding is that your latest recommendation allows us to achieve this.

Fri, Oct 10, 2:28 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence-Design-Review, Revise-Tone-Structured-Task, OKR-Work, Machine-Learning-Team, Growth-Team, Data-Persistence

Thu, Oct 9

Eevans updated subscribers of T401021: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task.

In T401021#11258859, @isarantopoulos wrote:

@Eevans Aiko has suggested a way to query for page_id,revision_id & model_version in T401021#11190742

PRIMARY KEY((wiki, page_id, revision_id), model_version)

Thu, Oct 9, 3:00 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence-Design-Review, Revise-Tone-Structured-Task, OKR-Work, Machine-Learning-Team, Growth-Team, Data-Persistence

Tue, Oct 7

Eevans added a comment to T401021: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task.

In T401021#11197656, @achou wrote:

[ ... ]

@Eevans We would like to use the following schema. What do you think?

CREATE TABLE table (
  wiki_id    text, -- enwiki, frwiki, etc
  page_id    int,
  revision_id    int,
  paragraphs    map<text, float>, -- plaintext paragraph with tone issues and score. can be null if no paragraphs have tone issues
  model_version    text,
  PRIMARY KEY((wiki, page_id, revision_id), model_version)
)

Tue, Oct 7, 1:12 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence-Design-Review, Revise-Tone-Structured-Task, OKR-Work, Machine-Learning-Team, Growth-Team, Data-Persistence

Eevans added a comment to T401021: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task.

In T401021#11246721, @isarantopoulos wrote:

Update: Growth team won't be doing the testwiki PoC this quarter, so we don't have an urgent timeline to ingest a one-off dataset to staging Cassandra

I'm circling back on this to figure out if we can align on the timelines. We would like to have the instance by Mid October (15th) so we can work on ingesting data that would enable an A/B test. Is this possible?

Tue, Oct 7, 12:19 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence-Design-Review, Revise-Tone-Structured-Task, OKR-Work, Machine-Learning-Team, Growth-Team, Data-Persistence

Mon, Oct 6

Eevans added a comment to T402850: Decide on anonymous session backend.

In T402850#11201328, @Tgr wrote:

@Eevans we are now very close to wrapping up the coding part of T400372: Separate storage backend for anonymous sessions. That will allow for separate Cassandra namespaces for anonymous and authenticated sessions (also e.g. per-wiki as proposed in T392170: sessionstorage namespacing if that's deemed useful), but also something more aggressive like using Cassandra for authenticated sessions but Memcached for anonymous sessions. (Using Memcached was proposed in T362335: Simplify MediaWiki session store at WMF but rejected because routine Memcached maintenance would then result in users getting logged out. With anonymous users only, that's not really a problem.)

What would be the best way to determine what store to use?

(cc @Krinkle @DAlangi_WMF)

Mon, Oct 6, 3:55 PM · Data-Persistence, OKR-Work, MediaWiki-Platform-Team

Eevans added a comment to T402984: Data Persistence Design Review: Article topic model caching.

In T402984#11246154, @isarantopoulos wrote:

I have updated ownership and expiration date
@Eevans There has been a change of plans regarding the integration of this work with this years Year In Review so although we still need this Cassandra instance the request that we have filed for the improve tone structured task in T401021: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task is of higher priority .I just wanted to mention this so you can handle your priorities and timelines accordingly.

Mon, Oct 6, 2:01 PM · Data-Persistence-Design-Review, Machine-Learning-Team, Data-Persistence

Thu, Oct 2

Eevans updated the task description for T402984: Data Persistence Design Review: Article topic model caching.

Thu, Oct 2, 3:52 PM · Data-Persistence-Design-Review, Machine-Learning-Team, Data-Persistence

Eevans updated the task description for T402984: Data Persistence Design Review: Article topic model caching.

Thu, Oct 2, 3:48 PM · Data-Persistence-Design-Review, Machine-Learning-Team, Data-Persistence

Eevans updated the task description for T402984: Data Persistence Design Review: Article topic model caching.

Thu, Oct 2, 3:08 PM · Data-Persistence-Design-Review, Machine-Learning-Team, Data-Persistence

Eevans updated the task description for T402984: Data Persistence Design Review: Article topic model caching.

Thu, Oct 2, 3:03 PM · Data-Persistence-Design-Review, Machine-Learning-Team, Data-Persistence

Eevans added a comment to T402984: Data Persistence Design Review: Article topic model caching.

In T402984#11235879, @BWojtowicz-WMF wrote:

[ ... ]

I see you filled out the description with all the discussed details, thank you a lot!

Thu, Oct 2, 1:52 PM · Data-Persistence-Design-Review, Machine-Learning-Team, Data-Persistence

Wed, Oct 1

Eevans updated the task description for T402984: Data Persistence Design Review: Article topic model caching.

Wed, Oct 1, 11:49 PM · Data-Persistence-Design-Review, Machine-Learning-Team, Data-Persistence

Eevans triaged T402984: Data Persistence Design Review: Article topic model caching as Medium priority.

Wed, Oct 1, 11:45 PM · Data-Persistence-Design-Review, Machine-Learning-Team, Data-Persistence

Eevans added a comment to T401394: ☂️ [FY2025-26][Hypothesis] WE6.2.3 Data Storage Design Review.

In T401394#11185802, @Ottomata wrote:

Couple more suggestions

Add a question about 'data product owners.

Who are the data product owners? Every data product should have at least 1 owning team and 2 people listed. At least one of the people listed should be an accountable manager.

Add a question about expiration date:

What is the data product expiration date? After this expiration date, platform maintainers can justify decommissioning this data product and supporting data pipelines. Anyone can update this expiration date at any time with no questions asked. If this expiration date passes, and their are no official product owners, the data product may be deleted.

Wed, Oct 1, 3:15 PM · Data-Persistence

Eevans added a comment to T401394: ☂️ [FY2025-26][Hypothesis] WE6.2.3 Data Storage Design Review.

In T401394#11185657, @Ottomata wrote:

@Eevans What do you think about including a link to https://wikitech.wikimedia.org/wiki/Data_Platform/Data_modeling_guidelines#Naming_and_data_type_conventions or maybe just https://wikitech.wikimedia.org/wiki/Data_Platform/Data_modeling_guidelines#WMF-specific_Conventions to the data modeling part of design review ? These will at least help with consistency for dataset sources within the data lake, but maybe not so much if you are doing data modeling for a MediaWiki table (which has its own schema guidelines).

Wed, Oct 1, 3:14 PM · Data-Persistence

Eevans added a comment to T402984: Data Persistence Design Review: Article topic model caching.

In T402984#11233362, @BWojtowicz-WMF wrote:

In this case I also agree that querying directly without Data Gateway would be the best option for us as well as deploying on RESTBase.

@Eevans I have a small curiosity question regarding the RESTBase vs AQL - for our type of real-time, short transaction processing should we expect better performance if deployed on RESTBase? What kind of gearing towards this OLTP processing is on RESTBase vs towards OLAP processing on the AQS server?

Wed, Oct 1, 2:54 PM · Data-Persistence-Design-Review, Machine-Learning-Team, Data-Persistence

Eevans added a comment to T403663: Upgrade Envoy to v1.29.12.

The RESTBase cluster has been upgraded to v1.29.12 (sorry for the delay, I was out all last week and missed the message).

Wed, Oct 1, 1:52 PM · Patch-For-Review, collaboration-services, SRE, serviceops, envoy

Tue, Sep 30

Eevans added a comment to T402984: Data Persistence Design Review: Article topic model caching.

In T402984#11228032, @BWojtowicz-WMF wrote:

[ ... ]

I think these are all quite reasonable. 1000 qps though might require us to scale up the Data Gateway though!

This performance expectation was linked directly to the Year in Review project, where we expected to process a few hundred queries per second, thus 1000QPS would be a safe choice with some error margin. However, it was decided lately that the article topic model will not be used in the incoming Year in Review, but possibly in the next year's edition. Thus, we will not be hitting the 1000QPS anytime soon, but such load would be possible in the future.

Tue, Sep 30, 2:09 PM · Data-Persistence-Design-Review, Machine-Learning-Team, Data-Persistence

Mon, Sep 29

Eevans added a comment to T402984: Data Persistence Design Review: Article topic model caching.

In T402984#11192876, @BWojtowicz-WMF wrote:

Why do we need Cache

Machine Learning Team decided to add Cache mechanism to our article topic model in order to meet the scale and throughput requirements for Year in Review project. Extensive description of the task and previous discussions on Cache design can be found here: https://phabricator.wikimedia.org/T401778.

[ ... ]

Table Schema

I'm suggesting to use a composite key consisting of 3 primary keys: page_id, lang, model_version. This composite key uniquely identifies topic predictions for a page and allows for efficient point queries.

Column Type Key Type Description

page_id Text Partition Key The ID of the Wikipedia page

lang Text Partition Key Language code for the page (e.g., 'en', 'fr', 'es')

model_version Text Partition Key Version identifier of the article topic model used

predictions map<text, float> - Mapping from topics to their predicted probability score

last_updated DateTime - Timestamp of when this cache entry was last updated

Mon, Sep 29, 11:40 PM · Data-Persistence-Design-Review, Machine-Learning-Team, Data-Persistence

Eevans added a comment to T401021: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task.

In T401021#11223140, @achou wrote:

[ ... ]

Update: Growth team won't be doing the testwiki PoC this quarter, so we don't have an urgent timeline to ingest a one-off dataset to staging Cassandra...

Mon, Sep 29, 9:35 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence-Design-Review, Revise-Tone-Structured-Task, OKR-Work, Machine-Learning-Team, Growth-Team, Data-Persistence

Sep 16 2025

Eevans added a project to T401260: Global Editor Metrics - Data Persistence Design Review: Data-Persistence-Design-Review.

Sep 16 2025, 2:46 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence-Design-Review, Data-Persistence

Eevans added a project to T402984: Data Persistence Design Review: Article topic model caching: Data-Persistence-Design-Review.

Sep 16 2025, 2:44 PM · Data-Persistence-Design-Review, Machine-Learning-Team, Data-Persistence

Eevans added a project to T401021: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task: Data-Persistence-Design-Review.

Sep 16 2025, 2:44 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence-Design-Review, Revise-Tone-Structured-Task, OKR-Work, Machine-Learning-Team, Growth-Team, Data-Persistence

Sep 10 2025

Eevans added a comment to T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.

In T401778#11166489, @BWojtowicz-WMF wrote:

[ ... ]

@Eevans How should we go forward now? We've discussed a few changes to the initial design in this thread - should I post an updated version of the design, which includes all the changes we've discussed? Is there anything else you'd need from us to progress with the deployments?

Sep 10 2025, 1:57 PM · Machine-Learning-Team

Eevans added a comment to T392283: Q1 FY2025-26 Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors.

In T392283#11164941, @Ottomata wrote:

Great stuff thank you Aiko!

the ML team plans to use an event-based solution, which is much preferable to a snapshot-based approach (from a meeting last week). We might/want to use Flink, which we haven't worked with before. Therefore, this work needs to be planned properly and the actual work could span an entire quarter.

FWIW, as noted here T401021#11159027 event based does not necessarily mean stream processing / Flink. You can source the updates from Hive event database tables, e.g. event.mediawiki_page_content_change.v1, and still use Airflow. This might actually be advantageous here, especially if you only want to generate new recommendations daily-ish. You can choose to generate a task for only the latest edit (or change e.g. a page delete) per page in your time period, potentially reducing the number of updates you need to send to Cassandra.

Sep 10 2025, 1:45 PM · OKR-Work, Goal, Machine-Learning-Team

Sep 5 2025

Eevans added a comment to T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.

In T401778#11151147, @BWojtowicz-WMF wrote:
Thank you for the discussion @Ottomata and @Eevans!

I think I'm leaning more into storing all predictions under the key of wiki + page_title + model_version and omitting the threshold alltogether from the Cache, leaving the prediction filtering to the application level. This indeed sounds to me like a way more flexible approach in the long term and also makes the data stored in Cache easier to understand then using binning strategy with different threshold or confidence_probability.

The table schema would look like this:
CREATE TABLE articletopic_cache (
  page_title    text,
  wiki          text,
  model_version text,
  predictions   map<text, float>,  -- Maps 64 topics to predicted probability score
  last_updated  timestamp,
  PRIMARY KEY((wiki, page_title), model_version)
)
My only worry for storing all 64 prediction topic+scores per row was the storage, but I might be unaware of some potential compression possibilities here.
My estimation was that a single entry storing 64 prediction topics+scores would be ~4.5 kilobytes based on the number total characters. Scaling this out to 65mil articles would sum up to ~270GB of storage with no compression.

Sep 5 2025, 1:51 PM · Machine-Learning-Team

Sep 4 2025

Eevans added a comment to T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.

In T401778#11149545, @Ottomata wrote:

[ ... ]

Storing all of the predictions and their corresponding score could be done by keying on wiki + page_id + model_version, or by wiki + page_id + model_version + threshold (see: T401778#11148221). The former would always give you every result...and so would the latter if you omitted the argument entirely, or you could specify it to get a subset that matched a threshold.

Oh! I see. I was expecting ALL 64 predictions to be stored in topics for a single key of wiki_id,page_id,model_version. You are suggesting to store all 64 results keyed by their prediction confidence probabilty, NOT the users's provided threshold filter.

Sep 4 2025, 6:58 PM · Machine-Learning-Team

Eevans added a comment to T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.

In T401778#11149206, @Ottomata wrote:

But again... 64 results isn't a lot, so if you want to elide such indexing in favor of late-filtering that's OK too.

If this is okay, then I think it would be a more flexible and useful cache. Right now the default is 0.5, but what if that changes? Or what if another (large) use case for a different threshold emerges?

Sep 4 2025, 6:04 PM · Machine-Learning-Team

Eevans added a comment to T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.

So to (try to )make this a bit more concrete:

Sep 4 2025, 2:33 PM · Machine-Learning-Team

Eevans updated the language for P82563 (An Untitled Masterwork) from autodetect to sql.

Sep 4 2025, 2:13 PM

Eevans created P82563 (An Untitled Masterwork).

Sep 4 2025, 2:13 PM

Eevans added a comment to T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.

In T401778#11147040, @BWojtowicz-WMF wrote:
Why is it more versatile?

@Eevans

I'll write down an example of request parameters and prediction we are generating:

Request payload:
{
  "page_title": "Douglas_Adams",
  "lang": "en",
  "threshold": 0.5
}
Predictions generated by our current model version model_version=alloutlinks_202209:
[
  {"topic":"Culture.Media.Media*","score":0.6859594583511353},
  {"topic":"Culture.Biography.Biography*","score":0.5544804334640503}, 
  {"topic":"Culture.Literature","score":0.5156299471855164},
]
In our current design, those 3 predictions above would be something that we want to store in the above_threshold_predictions column.
However, under the hood our prediction model always generates scores for 64 topics. We only use the threshold value to filter out the low-scoring topics when returning the response to the user. So instead of saving above_threshold_predictions for the combination of (page_title, lang, model_version, threshold), we could store all 64 prediction topics and scores for the combination of (page_title, lang, model_version) and do application-level filtering based on the threshold value.

This approach could be more versatile as we'd store more "complete" data in our Cache, containing all prediction scores and topics, but it would come with a cost of bigger storage.

Sep 4 2025, 2:10 PM · Machine-Learning-Team

Sep 3 2025

Eevans added a comment to T392283: Q1 FY2025-26 Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors.

In T392283#11143036, @Michael wrote:

In T392283#11141651, @Eevans wrote:

Keep in mind that we do have a staging environment as well (complete with Data Gateway, and a staging Cassandra cluster). If Growth is able to run their POC from there, that could be a good choice as well.

That sounds promising! Is this documented anywhere? I've seen a URL with -staging on https://wikitech.wikimedia.org/wiki/Data_Gateway but not sure what I should do with that.

Sep 3 2025, 8:27 PM · OKR-Work, Goal, Machine-Learning-Team

Sep 2 2025

Eevans added a comment to T392283: Q1 FY2025-26 Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors.

In T392283#11140256, @achou wrote:

In T392283#11136518, @achou wrote:

For this project, ideally we want to launch something beta in Q1. Currently we're looking at a predefined list of article types to generate at least 10K tasks offline (see Analysis work in this comment). A question I have is: How can we build this initial solution while ensuring we're designing a thoughtful and extendable architecture for the long term?

For the beta launch, one idea is to store generated tasks (results from T401968) as a one-off dataset and serve them via Data Gateway. We would use the finalized data model/schema decided in T401021, but wouldn't need to update the tasks. These tasks just serve for this prototype, not for production. This way would give us more time to plan and build the update pipeline, while enabling the Growth team to integrate into their improve tone POC sooner.

@Eevans Would this be viable for the Data Persistence team?

Sep 2 2025, 9:26 PM · OKR-Work, Goal, Machine-Learning-Team

Aug 29 2025

Eevans updated subscribers of T401021: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task.

Aug 29 2025, 3:41 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence-Design-Review, Revise-Tone-Structured-Task, OKR-Work, Machine-Learning-Team, Growth-Team, Data-Persistence

Eevans added a comment to T401021: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task.

In T401021#11131764, @Michael wrote:

For image-suggestions, we are accessing an URL that is basically <data-gateway>/public/image_suggestions/suggestions/<wikiId>/<articleId>. I think, we are likely going to do something very similar for improve-tone, and it should be close to trivial to extend it to include the revision id as well, if that makes it easier for you 👍

Though, I'd rather not touch image-suggestions at this point.

Aug 29 2025, 3:31 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence-Design-Review, Revise-Tone-Structured-Task, OKR-Work, Machine-Learning-Team, Growth-Team, Data-Persistence

Eevans added a comment to T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.

In T401778#11131169, @BWojtowicz-WMF wrote:

I'm adding a high-level diagram of the Cache design including the backfilling process, interactions with LiftWing and its users.

Aug 29 2025, 3:02 PM · Machine-Learning-Team

Eevans added a comment to T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.

In T401778#11131149, @BWojtowicz-WMF wrote:

Q: IIUC this is meant to be a 'query cache', rather than a more general purpose prediction cache, yes?

I'm not sure if I fully understand, but I think yes - currently the design stores the response for combination of query parameters including threshold. This means we store only the predictions+probabilities above the threshold rather then storing all predictions+probabilities.

I'm curious about the threshold part of the key. Is it necessary? Since the the threshold param comes from the users query anyway, could you not just cache the topic prediction in cassandra by (page_title, lang, model_version), and then use the query's threshold param as a filter to decide if a result should be returned?
That should keep the cache smaller too, as you aren't duplicating the same article topic prediction for every requested threshold filter?
(Apologies if I've totally missed something, I'm sure I'm very ignorant!)

I think it's an absolutely valid point! I agree with the point that caching all predictions for a page and doing application-level filtering based on threshold would be more versatile approach.

Aug 29 2025, 2:58 PM · Machine-Learning-Team

Aug 28 2025

Eevans added a comment to T401021: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task.

In T401021#11129144, @Ottomata wrote:

I think that many (most?) requests for 'derived data storage' are really about maintaining a 'materialized view' of data about a MediaWiki entity. Usually pages, sometimes revisions, and also sometimes users.

Aug 28 2025, 4:29 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence-Design-Review, Revise-Tone-Structured-Task, OKR-Work, Machine-Learning-Team, Growth-Team, Data-Persistence

Eevans added a comment to T401021: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task.

In T401021#11129075, @Ottomata wrote:

I like these ideas too. Q: could we generalize a bit for structured tasks or more generally page related derived metadata storage?

A standard-ish structured task and/or page (and possibly revision?) related cassandra data model. E.g. always keyed by (wiki_id, page_id (, revision?)). Other columns can be figured out.

A standard update pipeline to upsert data in cassandra that is keyed by (wiki_id, page_id, (revision?)). This could be batch or realtime, but either way should probably be based on page change events.

Aug 28 2025, 3:19 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence-Design-Review, Revise-Tone-Structured-Task, OKR-Work, Machine-Learning-Team, Growth-Team, Data-Persistence

Eevans added a comment to T401021: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task.

In T401021#11127977, @achou wrote:

Posting a problem that raised by @Eevans for the idea of having periodically renew set of suggestions stored in Cassandra, and I have some follow-up questions.

The general problem is that you have a "set" (The Suggestions), and you want to replace that with a new set periodically. So naively, you can delete the first set and then add the new set, but that has a couple of problems.

If the delete equates to "find all the things, and delete them" (ala DELETE * FROM table), it's wildly inefficient. We have examples of this pattern on MariaDB too where it's an enormous source of pain, but it's even worse on Cassandra because it is distributed. You could separately store a list of the primary keys (meaning, store them in another system), and reduce that to "...and delete them", but that's pretty brittle.

Would the cadence/frequency for regeneration of suggestions affect the severity of this problem? For example, would replacing the set every 3 months be more acceptable than doing it weekly?
For the beta experience, we might only have a fixed set of suggestions. From there, we will figure out the cadence for regenerating tasks after we understand their burn rate.

Aug 28 2025, 3:00 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence-Design-Review, Revise-Tone-Structured-Task, OKR-Work, Machine-Learning-Team, Growth-Team, Data-Persistence

Aug 26 2025

Eevans created T402984: Data Persistence Design Review: Article topic model caching.

Aug 26 2025, 9:37 PM · Data-Persistence-Design-Review, Machine-Learning-Team, Data-Persistence

Eevans added a comment to T373826: NetworkSessionProvider / CirrusSearch Streaming Updater causing 'session' log spam and possibly Sessionstore (Kask) problems.

In T373826#11121000, @Tgr wrote:

In T373826#11119581, @Eevans wrote:

Specifically, it looks like GET rate has increased by ~33%. Is this...expected?

Yeah, it happened the last time the updater was enabled, and nothing changed since then. The AbuseFilter issue we fixed was causing bogus writes, and those are mostly gone (we still have a few thousand, not sure why, but that's a drop in the bucket); it didn't affect reads.

SessionManager does a session store read at the beginning of every authenticated session, these are authenticated sessions; the same would happen if CirrusSearch did API requests in any other way, using OAuth or cookies or whatever. There isn't any way for a session provider to indicate that it doesn't need session store lookups (conceptually, session providers are responsible for session tokens in the web request and response, not for communication with the store; that is uniform across all session types).

It's also strictly speaking not true that the session store is never used meaningfully; there are some very fringe situations in which we store data there and read it back when using NetworkSession. User autocreation errors are cached in the session to avoid repeating potentially expensive user creation attempts on every request.

If we wanted to avoid reads (which, per above, isn't necessarily a good idea), once multi-session backends land, a hacky but simple solution would be to direct all session store reads/writes with a NetworkSession provider to a fake session store (HashBagOStuff). Or maybe a real but very cheap, non-replicated cache like APC.

Aug 26 2025, 7:02 PM · Discovery-Search (2025.09.26 - 2025.10.17), Essential-Work, MediaWiki-Platform-Team (Radar), Wikimedia-production-error, CirrusSearch, NetworkSession

Eevans added a comment to T373826: NetworkSessionProvider / CirrusSearch Streaming Updater causing 'session' log spam and possibly Sessionstore (Kask) problems.

In T373826#11119600, @dcausse wrote:

In T373826#11119581, @Eevans wrote:

In T373826#11119314, @dcausse wrote:

After deploying: the session write dashboard remains almost empty, sessionstore rps increased but unless I'm not reading the dashboard properly latencies appear to be fine, only the buckets "<1m" and "1-2.5ms" increased.

Specifically, it looks like GET rate has increased by ~33%. Is this...expected?

The user using the NetworkSession auth mechanism can make a lot of requests, see T373826#10141552,

Aug 26 2025, 3:43 PM · Discovery-Search (2025.09.26 - 2025.10.17), Essential-Work, MediaWiki-Platform-Team (Radar), Wikimedia-production-error, CirrusSearch, NetworkSession

Eevans added a comment to T373826: NetworkSessionProvider / CirrusSearch Streaming Updater causing 'session' log spam and possibly Sessionstore (Kask) problems.

In T373826#11119314, @dcausse wrote:

After deploying: the session write dashboard remains almost empty, sessionstore rps increased but unless I'm not reading the dashboard properly latencies appear to be fine, only the buckets "<1m" and "1-2.5ms" increased.

Aug 26 2025, 3:09 PM · Discovery-Search (2025.09.26 - 2025.10.17), Essential-Work, MediaWiki-Platform-Team (Radar), Wikimedia-production-error, CirrusSearch, NetworkSession

Aug 21 2025

Eevans added a comment to T402346: hw troubleshooting: disk (sdg) errors on ms-be1071.

In T402346#11106641, @MatthewVernon wrote:

[ ... ]
Showing my working:
lshw -C disk says /dev/sdg is bus info: scsi@0:2.5.0.
megacli -ldpdinfo -a0 tells us Target Id: 5 is associated with Enclosure Device ID: 32, Slot Number: 3, and notes the media errors:

Aug 21 2025, 2:33 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops

Eevans added a comment to T361964: Golang-based Cassandra clients do not perform TLS host verification.

@elukey as I recall, you didn't want to go the IP SAN route, is that correct?

Aug 21 2025, 1:14 AM · Data-Engineering, AQS2.0, Cassandra

Eevans updated the task description for T361964: Golang-based Cassandra clients do not perform TLS host verification.

Aug 21 2025, 1:13 AM · Data-Engineering, AQS2.0, Cassandra

Aug 20 2025

Eevans updated the task description for T361964: Golang-based Cassandra clients do not perform TLS host verification.

Aug 20 2025, 8:04 PM · Data-Engineering, AQS2.0, Cassandra

Eevans updated the task description for T361964: Golang-based Cassandra clients do not perform TLS host verification.

Aug 20 2025, 7:58 PM · Data-Engineering, AQS2.0, Cassandra

Eevans added a comment to T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.

In T401778#11097502, @BWojtowicz-WMF wrote:

Thank you for the quick answers @Eevans! I'll schedule a call for us, where I will share the larger context, but I also think it'll be useful to continue the discussion in this ticket.

For point of clarification, when you refer to staging & production here, are you referring to Cassandra clusters owned by the ML team, or the ones operated by Data Persistence? It seems like you're referring to the former versus the latter, but I was only aware of the ml-cache cluster(s).

I was indeed referring to the former - as I understand it, there already exists a Cassandra deployment, which was created for the ml-cache functionality, but it was deployed only on the ml-staging k8s cluster so far. So this might be the ml-cache cluster, which you have in mind. Perhaps @isarantopoulos or @klausman would have more information about the existing deployment?

Aug 20 2025, 12:11 AM · Machine-Learning-Team

Aug 19 2025

Eevans created T402346: hw troubleshooting: disk (sdg) errors on ms-be1071.

Aug 19 2025, 8:00 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops

Eevans added a comment to T402247: rsyslog is segfaulting non-stop on ms-be1071.

In T402247#11099594, @andrea.denisse wrote:

In T402247#11099398, @Eevans wrote:

rsyslog is back up and running after clearing the queue (/var/spool/rsyslog/*), which apparently was corrupted.

Strange, I cleared up the queue yesterday but that didn't resolve the issue. Did you do any additional steps?

Aug 19 2025, 6:06 PM · SRE Observability (FY2025/2026-Q1), Observability-Logging, SRE-swift-storage

Eevans added a comment to T402247: rsyslog is segfaulting non-stop on ms-be1071.

rsyslog is back up and running after clearing the queue (/var/spool/rsyslog/*), which apparently was corrupted.

Aug 19 2025, 4:53 PM · SRE Observability (FY2025/2026-Q1), Observability-Logging, SRE-swift-storage

Aug 18 2025

Eevans added a comment to T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.

In T401778#11093479, @BWojtowicz-WMF wrote:

Hello @Eevans @Marostegui! In relation to work described in this ticket, we'd like to use the existing Cassandra deployment on the staging ML cluster to validate our design for the caching mechanism. In order to do that, we would need to create the needed keyspace/table and users in the Cassandra deployment. Once we'd run tests and validate the idea in staging environment, we would like to create a similar deployment in the production cluster.

Aug 18 2025, 3:46 PM · Machine-Learning-Team

Aug 14 2025

Eevans added a comment to T401021: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task.

In T401021#11082998, @achou wrote:

[ ... ]

Development

Write the Oozie job to move data from Hadoop to Cassandra; Verify the output is correct by outputting to plain JSON/test Hive table; The Oozie job will be unable to load into Wikimedia Cloud Cassandra instances (You just have to hope that the loading works)

Write the AQS endpoint, which includes the table schema spec and unit tests

Is this information still current? I'm specifically trying to understand how data moves from Hadoop to Cassandra if the data is generated and stored in Hadoop, and the Growth team wants to read the data stored in Cassandra via Data Gateway. I'm wondering how this process works and also how we can update data in Cassandra.
Could you give me some pointers? Or maybe @Ottomata would know? :)

Aug 14 2025, 8:49 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence-Design-Review, Revise-Tone-Structured-Task, OKR-Work, Machine-Learning-Team, Growth-Team, Data-Persistence

Eevans updated the task description for T401877: aqs: Cassandra read timeouts.

Aug 14 2025, 1:34 PM · Cassandra

Eevans updated the task description for T401877: aqs: Cassandra read timeouts.

Aug 14 2025, 1:33 PM · Cassandra

Eevans added a comment to T401877: aqs: Cassandra read timeouts.

The debug logs (text log files located on the Cassandra cluster nodes) currently cover a period spanning from about the middle of May (about May 20) to today (Aug 14). Among them (all nodes) I can find 290 examples, 259 of which are for image_suggestions.suggestions. Of those 259 timeouts, there are only 93 unique wiki/page_id pairs (the partition key).

Aug 14 2025, 1:26 AM · Cassandra

Eevans triaged T401877: aqs: Cassandra read timeouts as Medium priority.

Aug 14 2025, 1:12 AM · Cassandra

Eevans created T401877: aqs: Cassandra read timeouts.

Aug 14 2025, 1:12 AM · Cassandra

Eevans added a comment to T368096: mediawiki: migrate from image-suggestion to data-gateway.

The 500s are indeed the result of query read timeouts at the coordinator nodes, and for the queries in question, they all reliably timeout even when ran from a command shell:

Aug 14 2025, 1:06 AM · MW-1.45-notes (1.45.0-wmf.15; 2025-08-19), Patch-For-Review, Growth-Team, Cassandra, serviceops

Eevans updated the task description for T368096: mediawiki: migrate from image-suggestion to data-gateway.

Aug 14 2025, 12:57 AM · MW-1.45-notes (1.45.0-wmf.15; 2025-08-19), Patch-For-Review, Growth-Team, Cassandra, serviceops

Aug 13 2025

Eevans added a comment to T368096: mediawiki: migrate from image-suggestion to data-gateway.

In T368096#11080846, @Scott_French wrote:
Alright, the first ImageSuggestions job (cawiki) seems to have completed without issue after a typical ~ 10m run duration. No errors reported in logstash on the ImageSuggestions channel other than the No more articles with suggestions found error typically emitted by he very last batch to execute.

The mediawiki-side envoy metrics look reasonable as well - e.g., no influx of non-2xx response codes that one might naively expect if we've somehow messed up the URL path format string.

In all, there are 3 5xx errors buried in there, all of which appear to be due to query timeouts - e.g., from pod/aqs-http-gateway-main-8567569b67-zcgmn:
{"@timestamp":"2025-08-13T00:05:48Z","message":"Operation timed out - received only 1 responses.","client":{"ip":"127.0.0.1","port":"46628"},"log":{"level":"ERROR"},"service":{"name":"data-gateway"},"trace":{"id":"db55fb89a3290ce8a1234400"},"ecs":{"version":"1.11.0"}}
Not quite sure how frequent those kinds of errors are expected to be in practice, but @Eevans might have some intuition.

Aug 13 2025, 1:38 PM · MW-1.45-notes (1.45.0-wmf.15; 2025-08-19), Patch-For-Review, Growth-Team, Cassandra, serviceops

Aug 12 2025

Eevans updated the task description for T368096: mediawiki: migrate from image-suggestion to data-gateway.

Aug 12 2025, 11:34 PM · MW-1.45-notes (1.45.0-wmf.15; 2025-08-19), Patch-For-Review, Growth-Team, Cassandra, serviceops

Eevans updated the task description for T368096: mediawiki: migrate from image-suggestion to data-gateway.

Aug 12 2025, 11:33 PM · MW-1.45-notes (1.45.0-wmf.15; 2025-08-19), Patch-For-Review, Growth-Team, Cassandra, serviceops

Eevans updated the task description for T368096: mediawiki: migrate from image-suggestion to data-gateway.

Aug 12 2025, 11:33 PM · MW-1.45-notes (1.45.0-wmf.15; 2025-08-19), Patch-For-Review, Growth-Team, Cassandra, serviceops

Aug 7 2025

Eevans merged task T400503: ☂️ [FY2025-26][Hypothesis] WE6.2.3 Data Storage Design Review into T401394: ☂️ [FY2025-26][Hypothesis] WE6.2.3 Data Storage Design Review.

Aug 7 2025, 1:49 PM · Data-Persistence

Eevans merged T400503: ☂️ [FY2025-26][Hypothesis] WE6.2.3 Data Storage Design Review into T401394: ☂️ [FY2025-26][Hypothesis] WE6.2.3 Data Storage Design Review.

Aug 7 2025, 1:49 PM · Data-Persistence

Eevans created T401394: ☂️ [FY2025-26][Hypothesis] WE6.2.3 Data Storage Design Review.

Aug 7 2025, 1:47 PM · Data-Persistence

Aug 5 2025

Eevans updated the task description for T401260: Global Editor Metrics - Data Persistence Design Review.

Aug 5 2025, 10:02 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence-Design-Review, Data-Persistence

Eevans triaged T401260: Global Editor Metrics - Data Persistence Design Review as Medium priority.

Aug 5 2025, 10:01 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence-Design-Review, Data-Persistence

Eevans created T401260: Global Editor Metrics - Data Persistence Design Review.

Aug 5 2025, 10:01 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence-Design-Review, Data-Persistence

Aug 4 2025

Eevans added a comment to T401127: Swift device facts / names for new JBOD controllers.

Oh good, so it's not just me. :)

Aug 4 2025, 9:27 PM · SRE, SRE-swift-storage

Eevans updated the task description for T401021: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task.

Aug 4 2025, 4:40 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence-Design-Review, Revise-Tone-Structured-Task, OKR-Work, Machine-Learning-Team, Growth-Team, Data-Persistence

Eevans added a comment to T401021: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task.

In T401021#11057026, @Michael wrote:

Some first thoughts:

[ ... ]

it is mostly structured data, (the json model response, some meta data) + plus a paragraph of plain text (~parsed wiki text), precise details TBD

Aug 4 2025, 3:25 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence-Design-Review, Revise-Tone-Structured-Task, OKR-Work, Machine-Learning-Team, Growth-Team, Data-Persistence

Eevans triaged T401021: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task as Medium priority.

Aug 4 2025, 2:59 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence-Design-Review, Revise-Tone-Structured-Task, OKR-Work, Machine-Learning-Team, Growth-Team, Data-Persistence

Aug 1 2025

Eevans created T401021: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task.

Aug 1 2025, 6:30 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence-Design-Review, Revise-Tone-Structured-Task, OKR-Work, Machine-Learning-Team, Growth-Team, Data-Persistence

Jul 28 2025

Eevans added a comment to T368096: mediawiki: migrate from image-suggestion to data-gateway.

In T368096#11040830, @Michael wrote:

[ ... ]

On our end, we will likely especially be watching two panels during the migration:
This panel shows the latency for the requests to the API at wgGEImageRecommendationServiceUrl
https://grafana.wikimedia.org/d/vGq7hbnMz/special3a-homepage-and-suggested-edits?orgId=1&from=now-7d&to=now&timezone=utc&var-platform=$__all&var-UserImpactHandlerPingLimiter=$__all&var-impactrendermode=$__all&viewPanel=panel-45
If the new data-gateway is substantially faster/slower, then that should be visible here.

And this panel shows the time to process the suggestions returned from the API:
https://grafana.wikimedia.org/d/vGq7hbnMz/special3a-homepage-and-suggested-edits?orgId=1&from=now-7d&to=now&timezone=utc&var-platform=$__all&var-UserImpactHandlerPingLimiter=$__all&var-impactrendermode=$__all&viewPanel=panel-189
So if for some reason the new API should not return suggestions, then that would be visible here because the time to process them would be collapsing.

Jul 28 2025, 7:58 PM · MW-1.45-notes (1.45.0-wmf.15; 2025-08-19), Patch-For-Review, Growth-Team, Cassandra, serviceops

Eevans added a comment to T368096: mediawiki: migrate from image-suggestion to data-gateway.

In T368096#11037418, @Michael wrote:

In T368096#11025294, @Scott_French wrote:

The data-gateway listener is now available (though unused) in production MediaWiki at localhost:6038.

One question that came up while reviewing the current state of configuration:

[...]

Specifically, its configuration appears to statically reference image-suggestion.discovery.wmnet, and those configuration keys don't appear to be overridden where the extension is enabled in mediawiki-config.

Are these keys unused in practice, or is the extension actually side-stepping the service mesh?

Is "statically reference image-suggestion.discovery.wmnet" something that is actually still working (or has ever worked) with the legacy image-suggestion service?