Page MenuHomePhabricator

EBernhardson (EBernhardson)
User

Today

  • No visible events.

Tomorrow

  • No visible events.

Saturday

  • No visible events.

User Details

User Since
Oct 7 2014, 4:49 PM (576 w, 1 d)
Availability
Available
LDAP User
EBernhardson
MediaWiki User
EBernhardson (WMF) [ Global Accounts ]

Recent Activity

Yesterday

EBernhardson added a comment to T406020: Tool for testing different weightings in search results.

Ran relforge reports for adjusting the near_match_weight on commonswiki, along with mediawikiwiki to see if this has different effects in different places. I can't share the full reports as they contain user search queries, but the top level stats are shareable. I'm only including commonswiki here as it was the only interesting one, one mediawiki.org increasing the near_match weights had almost no effect, suggesting this fix is specific to how commonswiki is organized.

Wed, Oct 22, 6:57 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch

Mon, Oct 20

EBernhardson added a comment to T407521: Represent text in cirrus as an array of sections, rather than a flat string.

Related question regarding flow of data, based on the comment from the thread you linked.

The wikitext -> html happens inside the mediawiki application using the default mediawiki parser. I'm not sure what exactly happens under the hood, i expect it's a full php parser that runs in-process but i haven't paid enough attention to exactly what they do. This is indeed quite expensive, we are running hundreds of pages a second through the parser. Part of the reason i suggest we could do this is because we already parse this flow of data. Even at this high rate, it still takes a long time to get through everything. We have a loop that re-renders everything even if not edited, but it works on 16 week cycles.

A html dataset (T360794) is a request for data engineering with a number of use cases, and has been discussed in related phab tasks for years. The linked phab task is for an incremental html dataset, which is "the easier" part of a html dataset and will hopefully get prioritized soon. I have focused on that part to get something off the ground. The more challenging part is creating the html dataset of historical revisions (e.g. render with what mediawiki version, what to do with templates, etc..).

  • do I understand this right that the full re-render loop taking 16 weeks is for the "current" content of all pages (i.e. not historical)? That is indeed a long time.
Mon, Oct 20, 4:55 PM · Discovery-Search
EBernhardson claimed T407514: Ignore MacOS .DS_Store in parent pom.
Mon, Oct 20, 3:27 PM · Discovery-Search (2025.09.26 - 2025.10.17), Java-Scala-Standardization, Essential-Work, Data-Engineering
EBernhardson added a comment to T407514: Ignore MacOS .DS_Store in parent pom.

MR: https://gitlab.wikimedia.org/repos/maven/wmf-jvm-parent-pom/-/merge_requests/27

Mon, Oct 20, 3:27 PM · Discovery-Search (2025.09.26 - 2025.10.17), Java-Scala-Standardization, Essential-Work, Data-Engineering
EBernhardson updated subscribers of T407521: Represent text in cirrus as an array of sections, rather than a flat string.
Mon, Oct 20, 3:20 PM · Discovery-Search
EBernhardson added a comment to T407521: Represent text in cirrus as an array of sections, rather than a flat string.

related slack discussion: https://wikimedia.slack.com/archives/C0975D4NLQY/p1759903593949559

Mon, Oct 20, 3:17 PM · Discovery-Search

Fri, Oct 17

EBernhardson claimed T406205: Investigate and cleanup broken weighted_tags in cirrus indices.
Fri, Oct 17, 4:06 PM · MW-1.45-notes (1.45.0-wmf.25; 2025-10-28), Patch-For-Review, Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch

Thu, Oct 16

EBernhardson moved T407520: Deploy cirrus-highlighter plugin to fix surrogate matching from Incoming to Blocked / Waiting on the Discovery-Search (2025.09.26 - 2025.10.17) board.
Thu, Oct 16, 7:40 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Discovery-Search (2025.09.26 - 2025.10.17), Essential-Work, CirrusSearch
EBernhardson edited projects for T407520: Deploy cirrus-highlighter plugin to fix surrogate matching, added: Discovery-Search (2025.09.26 - 2025.10.17); removed Discovery-Search.
Thu, Oct 16, 7:39 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Discovery-Search (2025.09.26 - 2025.10.17), Essential-Work, CirrusSearch
EBernhardson updated the task description for T407521: Represent text in cirrus as an array of sections, rather than a flat string.
Thu, Oct 16, 7:31 PM · Discovery-Search
EBernhardson updated the task description for T407521: Represent text in cirrus as an array of sections, rather than a flat string.
Thu, Oct 16, 7:31 PM · Discovery-Search
EBernhardson edited projects for T407520: Deploy cirrus-highlighter plugin to fix surrogate matching, added: Data-Platform-SRE; removed Patch-For-Review.

Updated .deb is available from gitlab. This should be ready for hand-off to SRE to upload the deb to apt.wikimedia.org and restart the clusters. Once the .deb is available from apt.wikimedia.org we will also need to:

Thu, Oct 16, 6:49 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Discovery-Search (2025.09.26 - 2025.10.17), Essential-Work, CirrusSearch
EBernhardson added a comment to T407520: Deploy cirrus-highlighter plugin to fix surrogate matching.

built a new release and deployed to maven central, as 1.3.20-wmf4. For example: https://central.sonatype.com/artifact/org.wikimedia.search.highlighter/cirrus-highlighter-core

Thu, Oct 16, 5:28 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Discovery-Search (2025.09.26 - 2025.10.17), Essential-Work, CirrusSearch
EBernhardson added a comment to T407521: Represent text in cirrus as an array of sections, rather than a flat string.

An initial, simple, proposal would be to split the text field on section boundaries, and retain the section title as a header. This would mean having duplicates of the headings (in both the headings and text fields), increasing the importance of the heading content, but probably not a big deal.

Thu, Oct 16, 5:19 PM · Discovery-Search
EBernhardson created T407521: Represent text in cirrus as an array of sections, rather than a flat string.
Thu, Oct 16, 5:05 PM · Discovery-Search
EBernhardson moved T405059: Adapt hasrecommendation to filter by score and possibly rank by score from To be Deployed to Done on the Discovery-Search (2025.09.26 - 2025.10.17) board.
Thu, Oct 16, 3:15 PM · Growth-Team, Revise-Tone-Structured-Task, Essential-Work, MW-1.45-notes (1.45.0-wmf.22; 2025-10-07), Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch
EBernhardson added a comment to T406920: deepcategory search fails to show all expected results.

it might be convenient if we had some tool that could walk the category graph on wiki, then query the same thing out of blazegraph and compare them. Some quick way to identify where the issues might be.

Thu, Oct 16, 2:59 PM · Essential-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch, Commons

Wed, Oct 15

EBernhardson added a comment to T404647: AB test did-you-mean query suggester variations.

From our review of the initial reports there is also a bit of surprise around the opening_text language model performing worse than the default language model. One plausible explanation is that there are word patterns seen in queries but not the opening text, only in the title fields. As such it would be interesting to run a follow-up test comparing title+redirect.title vs title+redirect.title+opening_text. For that I've created T407432.

Wed, Oct 15, 8:54 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch
EBernhardson created T407432: Follow-up AB test of dym language model variants.
Wed, Oct 15, 8:52 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), MW-1.45-notes (1.45.0-wmf.24; 2025-10-21), CirrusSearch
EBernhardson added a comment to T404647: AB test did-you-mean query suggester variations.

We reviewed the report in the wed meeting, where a report against major spaceless languages was requested. A quick runthrough of a report restricted to zhwiki and jawiki finds that the default_1v profile is significantly better than the others, suggesting that the variant does potentially have benefits, but it may depend on which language. As such I've run a batch of reports against the top few wikis by size, and a few selected languages that have unique language features:

Wed, Oct 15, 8:49 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch
EBernhardson added a comment to T404647: AB test did-you-mean query suggester variations.

I mistakenly posted this to the parent ticket, but it belongs here:

Wed, Oct 15, 6:39 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch

Tue, Oct 14

EBernhardson added a comment to T390858: Improve CirrusSearch DYM suggestions using the phrase suggester on more content.

Preliminary reports. They might become final, but they haven't been reviewed by anyone else yet:

Tue, Oct 14, 9:46 PM · MW-1.45-notes (1.45.0-wmf.19; 2025-09-16), Epic, Discovery-Search, CirrusSearch
EBernhardson moved T376026: Update event-producing tools to overwrite `meta.dt` from Needs Review to Done on the Discovery-Search (2025.09.26 - 2025.10.17) board.
Tue, Oct 14, 5:06 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), Data-Engineering (Q1 FY25/26 July 1st - September 30th), Event-Platform
EBernhardson moved T397367: Drop unneeded empty tables from wikis from Blocked / Waiting to Done on the Discovery-Search (2025.09.26 - 2025.10.17) board.
Tue, Oct 14, 5:06 PM · Discovery-Search (2025.09.26 - 2025.10.17), MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), DBA
EBernhardson added a comment to T405867: MLR: Mine and use negative samples.

We experiemented with this, and a model is available in production (example query, but the results just aren't good enough. Calling this complete without implementing it into mjolnir.

Tue, Oct 14, 5:05 PM · Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch
EBernhardson moved T405867: MLR: Mine and use negative samples from Blocked / Waiting to Done on the Discovery-Search (2025.09.26 - 2025.10.17) board.
Tue, Oct 14, 5:04 PM · Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch

Fri, Oct 10

EBernhardson claimed T406020: Tool for testing different weightings in search results.
Fri, Oct 10, 7:07 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch

Mon, Oct 6

EBernhardson moved T405867: MLR: Mine and use negative samples from In Progress to Blocked / Waiting on the Discovery-Search (2025.09.26 - 2025.10.17) board.
Mon, Oct 6, 5:49 PM · Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch
EBernhardson added a comment to T40403: Sortable search results.

Copying comment from merged task:

Mon, Oct 6, 4:32 PM · Patch-For-Review, Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch

Fri, Oct 3

EBernhardson claimed T404647: AB test did-you-mean query suggester variations.
Fri, Oct 3, 4:47 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch
EBernhardson moved T404647: AB test did-you-mean query suggester variations from Ready for Dev to Blocked / Waiting on the Discovery-Search (2025.09.26 - 2025.10.17) board.
Fri, Oct 3, 4:47 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch

Thu, Oct 2

EBernhardson added a comment to T404647: AB test did-you-mean query suggester variations.

Patch to configure and start the tests was prepped in the earlier patch, shipped the test today. Can turn it off Oct 13.

Thu, Oct 2, 8:28 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch
EBernhardson moved T405059: Adapt hasrecommendation to filter by score and possibly rank by score from Needs Review to To be Deployed on the Discovery-Search (2025.09.26 - 2025.10.17) board.
Thu, Oct 2, 8:06 PM · Growth-Team, Revise-Tone-Structured-Task, Essential-Work, MW-1.45-notes (1.45.0-wmf.22; 2025-10-07), Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch
EBernhardson claimed T405867: MLR: Mine and use negative samples.
Thu, Oct 2, 8:01 PM · Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch
EBernhardson moved T404822: Analysis: how many search queries are using natural language vs keywords from Needs Review to Done on the Discovery-Search (2025.09.26 - 2025.10.17) board.
Thu, Oct 2, 7:26 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), Semantic Search, Research
EBernhardson moved T403865: CirrusSearch SearchAfter implementation may skip documents from To be Deployed to Done on the Discovery-Search (2025.09.26 - 2025.10.17) board.
Thu, Oct 2, 6:49 PM · Discovery-Search (2025.09.26 - 2025.10.17), MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), Essential-Work, CirrusSearch
EBernhardson added a comment to T403593: CirrusSearch should allow filtering on page creation and last edit timestamps.

Something like this should be reasonable, along with the link to the documentation

Thu, Oct 2, 6:19 PM · MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), User-notice, Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch
EBernhardson moved T331029: Option to sort alphabetically in Search API from Feature Requests to needs triage on the Discovery-Search board.

I suspect the underlying technology is now sufficient to support alphabetical sorts (although we would have to evaluate it to be sure). The main sticking point in Cirrus today is going to be that the way keyword fields work in cirrus today allow doc_values to be enabled. We would need to migrate all the existing psuedo-keyword mappings to use normalizers, which then allows us to enable doc_values on appropriate keyword fields. Once the index mapping is in place the new sort is only a few lines of configuration in Cirrus.

Thu, Oct 2, 5:47 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch
EBernhardson moved T405482: Expand poolcounter heuristics to better capture automated requests from To be Deployed to Done on the Discovery-Search (2025.09.26 - 2025.10.17) board.
Thu, Oct 2, 5:14 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), MW-1.45-notes (1.45.0-wmf.21; 2025-09-30)

Tue, Sep 30

EBernhardson updated the title for P83492 mined hard negatives by relaxed retrieval query 2nd try from mined hard negatives by relaxed retrieval query to mined hard negatives by relaxed retrieval query 2nd try.
Tue, Sep 30, 3:41 PM
EBernhardson created P83492 mined hard negatives by relaxed retrieval query 2nd try.
Tue, Sep 30, 3:36 PM
EBernhardson closed T406047: mined hard negatives by relaxed retrieval query as Declined.
Tue, Sep 30, 3:36 PM
EBernhardson created T406047: mined hard negatives by relaxed retrieval query.
Tue, Sep 30, 3:35 PM
EBernhardson created P83487 mined enwiki hard negatives by result position relaxed query.
Tue, Sep 30, 3:04 PM

Mon, Sep 29

EBernhardson added a comment to T405867: MLR: Mine and use negative samples.

Rough outline of a plan, I expect this will first be worked up in a notebook and evaluated. We should be able to upload the models direct from the notebook to see them operate in prod. I'm not sure if we have a way to call out models by name in a debug manner, we might have to define a prod rescore profile that can access the model variant.

Mon, Sep 29, 5:06 PM · Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch
EBernhardson added a comment to T401590: Adjust CirrusSearchNamespaceWeights for Commons.

yea lets create a separate ticket as it will likely involve a few days work.

Mon, Sep 29, 4:53 PM · Essential-Work, Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch, Community-Tech

Fri, Sep 26

EBernhardson claimed T405482: Expand poolcounter heuristics to better capture automated requests.
Fri, Sep 26, 8:09 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), MW-1.45-notes (1.45.0-wmf.21; 2025-09-30)
EBernhardson moved T402629: Monitor CirrusSearch index failures from In Progress to To be Deployed on the Discovery-Search (2025.09.05 - 2025.09.26) board.
Fri, Sep 26, 8:08 PM · Discovery-Search (2025.09.26 - 2025.10.17), Essential-Work, CirrusSearch
EBernhardson added a comment to T404822: Analysis: how many search queries are using natural language vs keywords.

Similarly, should we filter searches in main article namespace only? (though I assume that there are very few queries that are not in main namespace).

Fri, Sep 26, 7:25 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), Semantic Search, Research
EBernhardson added a comment to T294079: Interwiki searchresults show very different image results from same search on Commons.

an additional difficulty with using mediasearch directly from commons is that file search is against both the local wiki and commons. It would be a change in functionality for it to start only displaying results from commons.

Fri, Sep 26, 6:45 PM · Discovery-Search, CirrusSearch
EBernhardson moved T403518: Add search filter for time since last edit from Blocked / Waiting to Done on the Discovery-Search (2025.09.05 - 2025.09.26) board.

The T403593 subtask is now deployed to production and ready for use. The feature is documented in Help:CirrusSearch.

Fri, Sep 26, 5:53 PM · Discovery-Search (2025.09.05 - 2025.09.26), Growth-Team, CirrusSearch, Revise-Tone-Structured-Task
EBernhardson moved T403865: CirrusSearch SearchAfter implementation may skip documents from Needs Review to To be Deployed on the Discovery-Search (2025.09.05 - 2025.09.26) board.
Fri, Sep 26, 5:49 PM · Discovery-Search (2025.09.26 - 2025.10.17), MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), Essential-Work, CirrusSearch
EBernhardson moved T403593: CirrusSearch should allow filtering on page creation and last edit timestamps from Not ready to announce to Announce in next Tech/News on the User-notice board.
Fri, Sep 26, 5:28 PM · MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), User-notice, Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch
EBernhardson moved T403593: CirrusSearch should allow filtering on page creation and last edit timestamps from To be Deployed to Done on the Discovery-Search (2025.09.05 - 2025.09.26) board.

Tested the keywords in prod, looks to be working as expected. Updated Help:CirrusSearch on mw.org with the proposed documentation from above.

Fri, Sep 26, 5:23 PM · MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), User-notice, Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch
EBernhardson added a comment to T404822: Analysis: how many search queries are using natural language vs keywords.

@EBernhardson Thanks for putting together the notebook. Looks really good, I appreciate the level of detail with respect to manual verification and having confidence intervals.

  • from what I understand, you operationalize natural language queries as all queries which contain one of the words who|what|where|when|why|how (and later do some additional manual filtering). Could you confirm? I think that approach makes sense and is sufficient to get a rough idea of the order of mangitude.
Fri, Sep 26, 4:24 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), Semantic Search, Research
EBernhardson moved T405482: Expand poolcounter heuristics to better capture automated requests from Needs Review to To be Deployed on the Discovery-Search (2025.09.05 - 2025.09.26) board.
Fri, Sep 26, 2:10 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), MW-1.45-notes (1.45.0-wmf.21; 2025-09-30)

Thu, Sep 25

EBernhardson added a comment to T401590: Adjust CirrusSearchNamespaceWeights for Commons.

I guess the open question is, should an exact title match somehow go around deboosts?

Intuitively - yes, I think so. Is there any way to confirm that with data?

Thu, Sep 25, 4:51 PM · Essential-Work, Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch, Community-Tech

Wed, Sep 24

EBernhardson moved T405482: Expand poolcounter heuristics to better capture automated requests from Incoming to Needs Review on the Discovery-Search (2025.09.05 - 2025.09.26) board.
Wed, Sep 24, 8:00 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), MW-1.45-notes (1.45.0-wmf.21; 2025-09-30)
EBernhardson added a comment to T405482: Expand poolcounter heuristics to better capture automated requests.

If a request is a web request and contains no cookies and contains an offset -> Automated

Wed, Sep 24, 3:30 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), MW-1.45-notes (1.45.0-wmf.21; 2025-09-30)
EBernhardson created T405482: Expand poolcounter heuristics to better capture automated requests.
Wed, Sep 24, 3:29 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), MW-1.45-notes (1.45.0-wmf.21; 2025-09-30)

Tue, Sep 23

EBernhardson added a comment to T405360: Implement an Airflow operator for moving data from point A to B.

How big are the individual files we need to move for this?

Tue, Sep 23, 5:51 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Wikimedia Enterprise - Content Integrity, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Wikimedia Enterprise, Essential-Work
EBernhardson added a comment to T404822: Analysis: how many search queries are using natural language vs keywords.

Not having the final confidence intervals was unsatisfying, so i went through and worked it up properly with references for how this is supposed to work within a stratified sample. Notebook has been updated to contain the calculation (please review! I am not an expert here).

Tue, Sep 23, 5:44 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), Semantic Search, Research
EBernhardson added a comment to T405360: Implement an Airflow operator for moving data from point A to B.

Blunderbuss could easily do this for you, with minimal resource usage on the Airflow executor side :-)

Tue, Sep 23, 3:12 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Wikimedia Enterprise - Content Integrity, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Wikimedia Enterprise, Essential-Work

Sep 22 2025

EBernhardson moved T404822: Analysis: how many search queries are using natural language vs keywords from In Progress to Needs Review on the Discovery-Search (2025.09.05 - 2025.09.26) board.

Initial estimate for the week of Sept 8 - 15.

Sep 22 2025, 10:17 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), Semantic Search, Research
EBernhardson claimed T404822: Analysis: how many search queries are using natural language vs keywords.
Sep 22 2025, 8:55 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), Semantic Search, Research
EBernhardson moved T404822: Analysis: how many search queries are using natural language vs keywords from Incoming to In Progress on the Discovery-Search (2025.09.05 - 2025.09.26) board.
Sep 22 2025, 8:55 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), Semantic Search, Research
EBernhardson added a comment to T366248: Source the CirrusSearch index dumps from hadoop instead of a MW maintenance script.

What do we think is the right way forward here? If SRE will be prioritizing implementing a newer method of getting data from hdfs to the public sites in the next month or so then it seems like this could wait around, but if it's uncertain when we will be prioritizing this work it seems reasonable to move forward with the existing puppet bits that invoke hdfs_tools::hdfs_rsync_job

Sep 22 2025, 7:55 PM · Discovery-Search (2025.09.26 - 2025.10.17), Data-Platform-SRE, Essential-Work, Patch-For-Review, DPE-Mediawiki-Content, Data-Engineering, CirrusSearch
EBernhardson moved T403593: CirrusSearch should allow filtering on page creation and last edit timestamps from Needs Review to To be Deployed on the Discovery-Search (2025.09.05 - 2025.09.26) board.
Sep 22 2025, 3:07 PM · MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), User-notice, Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch
EBernhardson moved T401590: Adjust CirrusSearchNamespaceWeights for Commons from To be Deployed to Done on the Discovery-Search (2025.09.05 - 2025.09.26) board.
Sep 22 2025, 3:06 PM · Essential-Work, Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch, Community-Tech

Sep 19 2025

EBernhardson added a comment to T401590: Adjust CirrusSearchNamespaceWeights for Commons.

Using the pageid filter we can get an explain that contains only the top three results and the target category:

Sep 19 2025, 5:35 PM · Essential-Work, Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch, Community-Tech

Sep 18 2025

EBernhardson added a comment to T404822: Analysis: how many search queries are using natural language vs keywords.

After further consideration, I remembered that query_clicks_hourly still does not contain mobile web requests, but those will need to be included here. To include mobile web we will need to start the analysis from web requests. This is more tedious as the dataset is quite large, but likely necessary. Will have to see if we can analyze a full week, due to data sizes we may have to break analysis up into per-day numbers and aggregate those daily numbers.

Sep 18 2025, 6:49 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), Semantic Search, Research
EBernhardson added a comment to T404822: Analysis: how many search queries are using natural language vs keywords.

I poked around the data a bit and experimented with a few things, i suspect we can do something like:

Sep 18 2025, 5:53 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), Semantic Search, Research

Sep 16 2025

EBernhardson added a comment to T403593: CirrusSearch should allow filtering on page creation and last edit timestamps.

Proposed Documentation, under the Filters heading:

Sep 16 2025, 8:07 PM · MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), User-notice, Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch
EBernhardson moved T403593: CirrusSearch should allow filtering on page creation and last edit timestamps from In Progress to Needs Review on the Discovery-Search (2025.09.05 - 2025.09.26) board.
Sep 16 2025, 5:29 PM · MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), User-notice, Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch
EBernhardson claimed T403593: CirrusSearch should allow filtering on page creation and last edit timestamps.
Sep 16 2025, 5:29 PM · MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), User-notice, Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch

Sep 15 2025

EBernhardson updated the task description for T404647: AB test did-you-mean query suggester variations.
Sep 15 2025, 8:38 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch
EBernhardson moved T403826: Evaluate did-you-mean suggestion variants and decide on an AB test plan from In Progress to Done on the Discovery-Search (2025.09.05 - 2025.09.26) board.

Seems like we have a decision. T404647 created to run the test.

Sep 15 2025, 8:35 PM · Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch
EBernhardson created T404647: AB test did-you-mean query suggester variations.
Sep 15 2025, 8:35 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch
EBernhardson added a comment to T403593: CirrusSearch should allow filtering on page creation and last edit timestamps.

I can fit hours into here if it's needed, but I do wonder if it will feel a bit awkward with consistently rounding time. What i mean is >2024 and <2024 round their comparisons to the nearest year, similarly for months or days. This feels natural (to me, at least) when working with those units. With hours we would have lasteditdate:>now-2h, do we also round that to hours? It feels more natural to me for such short timespans to be rounded to minutes, but that would lack consistency and make the system harder to explain. Not sure what the right approach is, but switching between them isnt too hard.

Sep 15 2025, 4:50 PM · MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), User-notice, Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch
EBernhardson moved T403826: Evaluate did-you-mean suggestion variants and decide on an AB test plan from Incoming to In Progress on the Discovery-Search (2025.09.05 - 2025.09.26) board.
Sep 15 2025, 3:18 PM · Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch

Sep 12 2025

EBernhardson added a comment to T403593: CirrusSearch should allow filtering on page creation and last edit timestamps.
  • Accepted format: (<|<=|>|>=)?(YYYY(-MM(-DD)?)?|now(-\d+[ymd])?)

We, Growth, can make that work, but I wonder if we could add h for hours too? So that we can say <now-24h (or maybe <now-36h if we want to have more slack). Though, if that is too complicated, we can also use <=now-2d for our offset to be sure that we exclude pages edited within the last 24 hours.

Sep 12 2025, 10:00 PM · MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), User-notice, Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch
EBernhardson added a comment to T403212: Support \r, \n, \t, and \uNNNN in insource and intitle queries.

@EBernhardson Thanks for adding these characters to the docs!
Re the \uNNNN matching, do you think it's worth clarifying the term "surrogate pairs" in those docs? I ask as it took me a few minutes to work out that — to search for the equivalent of intitle:/💔/ — I needed to use intitle:/\uD83D\uDC94/ (instead of e.g. just using intitle:/\u1F494/). Or do you think that people using this escape character would generally be expected to know what this would be referring to? (genuine question)

Sep 12 2025, 5:25 PM · User-notice-archive, Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch

Sep 11 2025

EBernhardson added a comment to T403593: CirrusSearch should allow filtering on page creation and last edit timestamps.

Looking over the date field docs and testing a few things, it looks like we can fairly easily support the syntax requested above. To me the biggest questions are around localization. As stated in the ticket there is the question of localtime vs UTC. There is also the question of date formats, is "05/04/25" in april, or may? Do we perhaps only accept YYYY, YYYY-MM, and YYYY-MM-DD?

Sep 11 2025, 7:37 PM · MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), User-notice, Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch

Sep 10 2025

EBernhardson added a comment to T403826: Evaluate did-you-mean suggestion variants and decide on an AB test plan.

Per discussion at wed meeting i added a couple more profiles and renamed the existing profiles to be more consistent. The names should now consistently be of the format: {profile_name}_{prefix_len}(_variant)?. The numbers are mostly but not directly comparable to above, all the queries were re-run which gave new latency numbers and different results for some queries. All of the results were re-graded into one of the 5 buckets.

Sep 10 2025, 8:04 PM · Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch
EBernhardson created P83172 (An Untitled Masterwork).
Sep 10 2025, 4:43 PM

Sep 9 2025

EBernhardson added a comment to T401590: Adjust CirrusSearchNamespaceWeights for Commons.

To get an idea of what an appropriate weight would be i ran some stats against an hour of incoming requests. Note that the first search result is considered position 1. Also note that this is not re-running the queries, it is applying custom weights to the scores and re-sorting the results that were provided. The true mean will likely be larger than presented here as galleries are pushed down and new results come into the result list.

Sep 9 2025, 5:33 PM · Essential-Work, Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch, Community-Tech

Sep 8 2025

EBernhardson added a comment to T403826: Evaluate did-you-mean suggestion variants and decide on an AB test plan.

Trying to put some sort of judgement on the list i came up with the following categories:

Sep 8 2025, 7:40 PM · Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch
EBernhardson added a comment to T317599: Allow ^ and $ in intitle regex search.

Can maybe link https://www.mediawiki.org/wiki/Help:CirrusSearch#Character_Classes which documents most of the new functionality (the rest is also documented on that page, but in a different section).

Sep 8 2025, 5:02 PM · User-notice-archive, Discovery-Search (2025.08.15 - 2025.09.05), CirrusSearch
EBernhardson added a comment to T317599: Allow ^ and $ in intitle regex search.

Thanks! I'll postpone including this until the following edition, partially in case T403212: Support \r, \n, \t, and \uNNNN in insource and intitle queries is completed by then and can be announced in the same entry (IIUC, that would be the clearest way to do so?). Plus, we'd need to update the proposed draft-entry to include both that other task and also link to any new documentation(?) about these new features. The current proposed draft-entry is still this:

  • regex search queries now support additional features including start-of-line (^) and end-of-line ($) anchors for the intitle keyword, as well as shorthand character classes for digits (\d), whitespace (\s), and word characters (\w) in both insource and intitle.
Sep 8 2025, 4:52 PM · User-notice-archive, Discovery-Search (2025.08.15 - 2025.09.05), CirrusSearch
EBernhardson moved T366248: Source the CirrusSearch index dumps from hadoop instead of a MW maintenance script from Needs Review to Blocked / Waiting on the Discovery-Search (2025.09.05 - 2025.09.26) board.
Sep 8 2025, 3:18 PM · Discovery-Search (2025.09.26 - 2025.10.17), Data-Platform-SRE, Essential-Work, Patch-For-Review, DPE-Mediawiki-Content, Data-Engineering, CirrusSearch
EBernhardson moved T403212: Support \r, \n, \t, and \uNNNN in insource and intitle queries from To be Deployed to Done on the Discovery-Search (2025.09.05 - 2025.09.26) board.
Sep 8 2025, 3:18 PM · User-notice-archive, Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch

Sep 5 2025

EBernhardson added a comment to T402220: Sudachi analysis chain fails on long emoji sequence.

reindex has completed on all clusters

Sep 5 2025, 4:20 PM · Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch, MW-1.45-notes (1.45.0-wmf.16; 2025-08-26)
EBernhardson moved T402220: Sudachi analysis chain fails on long emoji sequence from To be Deployed to Done on the Discovery-Search (2025.09.05 - 2025.09.26) board.
Sep 5 2025, 4:20 PM · Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch, MW-1.45-notes (1.45.0-wmf.16; 2025-08-26)
EBernhardson added a comment to T403826: Evaluate did-you-mean suggestion variants and decide on an AB test plan.

Bit of a first draft, this defines a few new profiles:

Sep 5 2025, 2:12 PM · Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch
EBernhardson created T403826: Evaluate did-you-mean suggestion variants and decide on an AB test plan.
Sep 5 2025, 2:09 PM · Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch
EBernhardson created P82618 (An Untitled Masterwork).
Sep 5 2025, 2:01 PM

Sep 4 2025

EBernhardson moved T401590: Adjust CirrusSearchNamespaceWeights for Commons from Needs Review to To be Deployed on the Discovery-Search (2025.08.15 - 2025.09.05) board.
Sep 4 2025, 6:42 PM · Essential-Work, Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch, Community-Tech
EBernhardson moved T403212: Support \r, \n, \t, and \uNNNN in insource and intitle queries from Needs Review to To be Deployed on the Discovery-Search (2025.08.15 - 2025.09.05) board.
Sep 4 2025, 6:42 PM · User-notice-archive, Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch
EBernhardson updated the task description for T403212: Support \r, \n, \t, and \uNNNN in insource and intitle queries.
Sep 4 2025, 4:23 PM · User-notice-archive, Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch
EBernhardson added a comment to T403212: Support \r, \n, \t, and \uNNNN in insource and intitle queries.

(i'm guessing this isn't ready to announce yet given that the patch isn't currently merged?)

Sep 4 2025, 4:21 PM · User-notice-archive, Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch