User Details
- User Since
- Oct 7 2014, 4:49 PM (576 w, 1 d)
- Availability
- Available
- LDAP User
- EBernhardson
- MediaWiki User
- EBernhardson (WMF) [ Global Accounts ]
Yesterday
Ran relforge reports for adjusting the near_match_weight on commonswiki, along with mediawikiwiki to see if this has different effects in different places. I can't share the full reports as they contain user search queries, but the top level stats are shareable. I'm only including commonswiki here as it was the only interesting one, one mediawiki.org increasing the near_match weights had almost no effect, suggesting this fix is specific to how commonswiki is organized.
Mon, Oct 20
related slack discussion: https://wikimedia.slack.com/archives/C0975D4NLQY/p1759903593949559
Fri, Oct 17
Thu, Oct 16
Updated .deb is available from gitlab. This should be ready for hand-off to SRE to upload the deb to apt.wikimedia.org and restart the clusters. Once the .deb is available from apt.wikimedia.org we will also need to:
built a new release and deployed to maven central, as 1.3.20-wmf4. For example: https://central.sonatype.com/artifact/org.wikimedia.search.highlighter/cirrus-highlighter-core
An initial, simple, proposal would be to split the text field on section boundaries, and retain the section title as a header. This would mean having duplicates of the headings (in both the headings and text fields), increasing the importance of the heading content, but probably not a big deal.
it might be convenient if we had some tool that could walk the category graph on wiki, then query the same thing out of blazegraph and compare them. Some quick way to identify where the issues might be.
Wed, Oct 15
From our review of the initial reports there is also a bit of surprise around the opening_text language model performing worse than the default language model. One plausible explanation is that there are word patterns seen in queries but not the opening text, only in the title fields. As such it would be interesting to run a follow-up test comparing title+redirect.title vs title+redirect.title+opening_text. For that I've created T407432.
We reviewed the report in the wed meeting, where a report against major spaceless languages was requested. A quick runthrough of a report restricted to zhwiki and jawiki finds that the default_1v profile is significantly better than the others, suggesting that the variant does potentially have benefits, but it may depend on which language. As such I've run a batch of reports against the top few wikis by size, and a few selected languages that have unique language features:
I mistakenly posted this to the parent ticket, but it belongs here:
Tue, Oct 14
Preliminary reports. They might become final, but they haven't been reviewed by anyone else yet:
We experiemented with this, and a model is available in production (example query, but the results just aren't good enough. Calling this complete without implementing it into mjolnir.
Fri, Oct 10
Mon, Oct 6
Copying comment from merged task:
Fri, Oct 3
Thu, Oct 2
Patch to configure and start the tests was prepped in the earlier patch, shipped the test today. Can turn it off Oct 13.
Something like this should be reasonable, along with the link to the documentation
I suspect the underlying technology is now sufficient to support alphabetical sorts (although we would have to evaluate it to be sure). The main sticking point in Cirrus today is going to be that the way keyword fields work in cirrus today allow doc_values to be enabled. We would need to migrate all the existing psuedo-keyword mappings to use normalizers, which then allows us to enable doc_values on appropriate keyword fields. Once the index mapping is in place the new sort is only a few lines of configuration in Cirrus.
Tue, Sep 30
Mon, Sep 29
Rough outline of a plan, I expect this will first be worked up in a notebook and evaluated. We should be able to upload the models direct from the notebook to see them operate in prod. I'm not sure if we have a way to call out models by name in a debug manner, we might have to define a prod rescore profile that can access the model variant.
yea lets create a separate ticket as it will likely involve a few days work.
Fri, Sep 26
Similarly, should we filter searches in main article namespace only? (though I assume that there are very few queries that are not in main namespace).
an additional difficulty with using mediasearch directly from commons is that file search is against both the local wiki and commons. It would be a change in functionality for it to start only displaying results from commons.
The T403593 subtask is now deployed to production and ready for use. The feature is documented in Help:CirrusSearch.
Tested the keywords in prod, looks to be working as expected. Updated Help:CirrusSearch on mw.org with the proposed documentation from above.
Thu, Sep 25
Wed, Sep 24
If a request is a web request and contains no cookies and contains an offset -> Automated
Tue, Sep 23
Not having the final confidence intervals was unsatisfying, so i went through and worked it up properly with references for how this is supposed to work within a stratified sample. Notebook has been updated to contain the calculation (please review! I am not an expert here).
Sep 22 2025
Initial estimate for the week of Sept 8 - 15.
What do we think is the right way forward here? If SRE will be prioritizing implementing a newer method of getting data from hdfs to the public sites in the next month or so then it seems like this could wait around, but if it's uncertain when we will be prioritizing this work it seems reasonable to move forward with the existing puppet bits that invoke hdfs_tools::hdfs_rsync_job
Sep 19 2025
Using the pageid filter we can get an explain that contains only the top three results and the target category:
Sep 18 2025
After further consideration, I remembered that query_clicks_hourly still does not contain mobile web requests, but those will need to be included here. To include mobile web we will need to start the analysis from web requests. This is more tedious as the dataset is quite large, but likely necessary. Will have to see if we can analyze a full week, due to data sizes we may have to break analysis up into per-day numbers and aggregate those daily numbers.
I poked around the data a bit and experimented with a few things, i suspect we can do something like:
Sep 16 2025
Proposed Documentation, under the Filters heading:
Sep 15 2025
Seems like we have a decision. T404647 created to run the test.
I can fit hours into here if it's needed, but I do wonder if it will feel a bit awkward with consistently rounding time. What i mean is >2024 and <2024 round their comparisons to the nearest year, similarly for months or days. This feels natural (to me, at least) when working with those units. With hours we would have lasteditdate:>now-2h, do we also round that to hours? It feels more natural to me for such short timespans to be rounded to minutes, but that would lack consistency and make the system harder to explain. Not sure what the right approach is, but switching between them isnt too hard.
Sep 12 2025
Sep 11 2025
Looking over the date field docs and testing a few things, it looks like we can fairly easily support the syntax requested above. To me the biggest questions are around localization. As stated in the ticket there is the question of localtime vs UTC. There is also the question of date formats, is "05/04/25" in april, or may? Do we perhaps only accept YYYY, YYYY-MM, and YYYY-MM-DD?
Sep 10 2025
Per discussion at wed meeting i added a couple more profiles and renamed the existing profiles to be more consistent. The names should now consistently be of the format: {profile_name}_{prefix_len}(_variant)?. The numbers are mostly but not directly comparable to above, all the queries were re-run which gave new latency numbers and different results for some queries. All of the results were re-graded into one of the 5 buckets.
Sep 9 2025
To get an idea of what an appropriate weight would be i ran some stats against an hour of incoming requests. Note that the first search result is considered position 1. Also note that this is not re-running the queries, it is applying custom weights to the scores and re-sorting the results that were provided. The true mean will likely be larger than presented here as galleries are pushed down and new results come into the result list.
Sep 8 2025
Trying to put some sort of judgement on the list i came up with the following categories:
Can maybe link https://www.mediawiki.org/wiki/Help:CirrusSearch#Character_Classes which documents most of the new functionality (the rest is also documented on that page, but in a different section).
Sep 5 2025
reindex has completed on all clusters
Bit of a first draft, this defines a few new profiles: