Page MenuHomePhabricator

BTullis (Ben)
Staff SRE

Today

  • No visible events.

Tomorrow

  • No visible events.

Saturday

  • No visible events.

User Details

User Since
Jun 29 2021, 9:56 AM (225 w, 1 d)
Availability
Available
IRC Nick
btullis
LDAP User
Btullis
MediaWiki User
BTullis (WMF) [ Global Accounts ]

Recent Activity

Yesterday

BTullis moved T353786: Decommission an-launcher1002 from Blocked/Waiting to In Progress on the Data-Platform-SRE (2025.10.17 - 2025.11.07) board.

Moving to in-progress, since all active workload has been migrated to an-launcher1003.

Wed, Oct 22, 5:54 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work
BTullis added a comment to T406429: No Wikidata dumps for Week 40 of 2025 (recurring issue).

This reminds me of T389199 (which ended up not being reproducible, just noting for reference).

Wed, Oct 22, 5:53 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Data-Engineering, Essential-Work, Wikibase Reuse Team, Wikidata data dumps, Wikidata, Dumps-Generation
BTullis added a comment to T406579: Deploy FerretDB for GrowthBook.

It's also interesting that not only does GrowthBook mention their support for FerretDB:
https://docs.growthbook.io/self-host/ferretdb

Wed, Oct 22, 5:48 PM · Patch-For-Review, Data-Platform-SRE (2025.10.17 - 2025.11.07), OKR-Work, Experimentation Lab
BTullis added a comment to T406579: Deploy FerretDB for GrowthBook.

@BTullis suggested that we could build an image based on trixie, in which the package exists in stable channels: https://packages.debian.org/trixie/postgresql-17-pgvector. That would involve using PG 17 instead of 15, as it is the stable PG version in trixie. We'd also need to take https://github.com/FerretDB/documentdb/releases/download/v0.106.0-ferretdb-2.5.0/deb12-postgresql-17-documentdb-dbgsym_0.106.0.ferretdb.2.5.0_amd64.deb and mirror it into our trixie repos.

Just noting that we would probably want to avoid the package that has dbgsym in its name.
From here: https://docs.ferretdb.io/installation/documentdb/deb/

  • For most use cases, we recommend using the production package (e.g., documentdb.deb).
  • For debugging purposes, use the development package (contains either -dev or -dbgsym suffix e.g., documentdb-dev.deb/documentdb-dbgsym.deb). It includes features that significantly slow down performance and is not recommended for production use.
Wed, Oct 22, 5:45 PM · Patch-For-Review, Data-Platform-SRE (2025.10.17 - 2025.11.07), OKR-Work, Experimentation Lab
BTullis added a comment to T406766: Add dbt related packages to conda-analytics.

Since I had rebuilt version 0.0.39 of conda-analytics, I updated the version on the apt servers.

btullis@apt1002:~$ wget https://gitlab.wikimedia.org/api/v4/projects/359/packages/generic/conda-analytics/0.0.39/conda-analytics-0.0.39_amd64.deb
--2025-10-22 17:20:30--  https://gitlab.wikimedia.org/api/v4/projects/359/packages/generic/conda-analytics/0.0.39/conda-analytics-0.0.39_amd64.deb
Resolving gitlab.wikimedia.org (gitlab.wikimedia.org)... 2620:0:861:2:208:80:154:145, 208.80.154.145
Connecting to gitlab.wikimedia.org (gitlab.wikimedia.org)|2620:0:861:2:208:80:154:145|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1094453352 (1.0G) [binary/octet-stream]
Saving to: ‘conda-analytics-0.0.39_amd64.deb’
Wed, Oct 22, 5:25 PM · OKR-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Data-Engineering (Q2 FY25/26 October 1st - December 31th)
BTullis closed T402943: Repeated failures to resolve an-master100[3-4] from an-launcher1002 - resulting in pipeline failures as Resolved.

We have now migrated all of the workload from an-launcher1002 to an-launcher1003, so I think that we can tentatively call this done.

Wed, Oct 22, 5:09 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work
BTullis added a comment to T406766: Add dbt related packages to conda-analytics.

We got this working with spark in session mode, using the dbt-core and dbt-spark packages in conda-analytics version 0.0.39.

Wed, Oct 22, 4:48 PM · OKR-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Data-Engineering (Q2 FY25/26 October 1st - December 31th)
BTullis added a comment to T406766: Add dbt related packages to conda-analytics.

I pushed out the version 0.0.39 package to the test-cluster.

btullis@cumin1003:~$ generate-debdeploy-spec 
<snip>
Wed, Oct 22, 11:35 AM · OKR-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Data-Engineering (Q2 FY25/26 October 1st - December 31th)
BTullis added a comment to T406766: Add dbt related packages to conda-analytics.

@BTullis wouldn't this approach introduce a discrepancy between what users use on stat boxes and what is run in GitLab CI/CD and eventually in Airflow? The latter two will run in Docker images, and I wonder how different the two installations will end up being.

Wed, Oct 22, 11:30 AM · OKR-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Data-Engineering (Q2 FY25/26 October 1st - December 31th)
BTullis added a comment to T405360: Implement an Airflow operator for moving data from point A to B.

I'm just flagging here an investigation that I looked at as part of T405360.
In T402943#11297764 we can see that we currently use hdfs-rsync with an NFS source (clouddumps1002) and an HDFS target.

Wed, Oct 22, 11:09 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Wikimedia Enterprise - Content Integrity, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Wikimedia Enterprise, Essential-Work
BTullis added a comment to T402943: Repeated failures to resolve an-master100[3-4] from an-launcher1002 - resulting in pipeline failures.

With a bit of investigation, it's clear which jobs are the most network-heavy. It's all of the jobs covered by this patch.

Wed, Oct 22, 10:59 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work
BTullis added a comment to T402943: Repeated failures to resolve an-master100[3-4] from an-launcher1002 - resulting in pipeline failures.

We have migrated most of the workload to an-launcher1003 and it has been running since yesterday without any errors.
One thing that is interesting is that one of the jobs is already exceeding the network throughput that it would have been able to achieve on an-launcher1002.

image.png (970×958 px, 122 KB)

https://grafana.wikimedia.org/goto/0ChlpHgDR?orgId=1

Wed, Oct 22, 9:58 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work

Tue, Oct 21

BTullis added a comment to T406766: Add dbt related packages to conda-analytics.

I have built a version 0.0.39 of conda-analytics and added it to apt.wikimedia.org
https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/jobs/653391

Tue, Oct 21, 3:43 PM · OKR-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Data-Engineering (Q2 FY25/26 October 1st - December 31th)
BTullis placed T403863: Jupyterhub: Decide on/display escalation paths up for grabs.

I'm removing myself as the active assignee, since I haven't got time to work on this right now.
It should be a relatively easy job to add the escalation details to the login.html fragment, now that we know that the template is rendering..

Tue, Oct 21, 2:01 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Patch-For-Review, Essential-Work
BTullis added a comment to T406766: Add dbt related packages to conda-analytics.

I have this patch to conda-analytics for review:
https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/59

Tue, Oct 21, 12:18 PM · OKR-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Data-Engineering (Q2 FY25/26 October 1st - December 31th)
BTullis added a comment to T406765: Create a new gitlab repository for use with dbt.

Suggestion: instead of naming the repo 'dbt' which looks a bit more like a fork of 'dbt', name it:

data-engineering/dbt-jobs

Or something like that?

Tue, Oct 21, 11:50 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
BTullis claimed T406766: Add dbt related packages to conda-analytics.

I'll make a start on this.

Tue, Oct 21, 9:31 AM · OKR-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Mon, Oct 20

BTullis triaged T407799: Increase the nginx proxy timeouts in superset to 185 seconds as Medium priority.
Mon, Oct 20, 9:16 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07)
BTullis moved T407799: Increase the nginx proxy timeouts in superset to 185 seconds from Backlog - project to Quick Wins on the Data-Platform-SRE (2025.10.17 - 2025.11.07) board.
Mon, Oct 20, 9:16 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07)
BTullis created T407799: Increase the nginx proxy timeouts in superset to 185 seconds.
Mon, Oct 20, 9:15 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07)
BTullis closed T407609: Upgrade our GrowthBook container image to version 4.1 as Resolved.

This image is now available.

btullis@barracuda:~/wmf/growthbook$ docker run -it docker-registry.wikimedia.org/repos/data-engineering/growthbook:2025-10-20-163649-7d2ca6af3de86d10c9df30819307ca1cd0830a7b
Unable to find image 'docker-registry.wikimedia.org/repos/data-engineering/growthbook:2025-10-20-163649-7d2ca6af3de86d10c9df30819307ca1cd0830a7b' locally
2025-10-20-163649-7d2ca6af3de86d10c9df30819307ca1cd0830a7b: Pulling from repos/data-engineering/growthbook
77a1eeafdb5a: Already exists 
05f6e46ebee1: Already exists 
bc796e87bac2: Pull complete 
b9977baba3dc: Pull complete 
b5e8a58d4622: Pull complete 
84be992ecae9: Pull complete 
f014e1080d8f: Pull complete 
74862184cf26: Pull complete 
Digest: sha256:d0e1d6d6e29d9e893bfd2dfd29553b7a9d32374a139cc9b069bfdfa4f8bb14e9
Status: Downloaded newer image for docker-registry.wikimedia.org/repos/data-engineering/growthbook:2025-10-20-163649-7d2ca6af3de86d10c9df30819307ca1cd0830a7b
yarn run v1.22.22
$ wsrun -p 'back-end' -p 'front-end' --no-prefix -c start
$ node dist/server.js
$ next start
  ▲ Next.js 14.2.26
  - Local:        http://localhost:3000
Mon, Oct 20, 5:11 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07), OKR-Work, Experimentation Lab
BTullis closed T407609: Upgrade our GrowthBook container image to version 4.1, a subtask of T405749: [EPIC] Deploy GrowthBook, as Resolved.
Mon, Oct 20, 5:11 PM · OKR-Work, Data-Platform-SRE, Experimentation Lab, Epic
BTullis added a comment to T406429: No Wikidata dumps for Week 40 of 2025 (recurring issue).

I have manually removed 5.1 TB of old dumps from the cephfs volume in T407735#11289038 and I have manually triggered a new run of the sync_wikibase_wikidatawiki_dumps DAG.

image.png (636×1 px, 136 KB)

Mon, Oct 20, 11:02 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Data-Engineering, Essential-Work, Wikibase Reuse Team, Wikidata data dumps, Wikidata, Dumps-Generation
BTullis added a comment to T407735: Configure automatic removal of old 'other' dumps from the cephfs dumps volume.

This should unblock the publishing of the latest dumps in T406429: No Wikidata dumps for Week 40 of 2025 (recurring issue).

Mon, Oct 20, 10:59 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07)
BTullis added a comment to T407735: Configure automatic removal of old 'other' dumps from the cephfs dumps volume.
runuser@mediawiki-dumps-legacy-sync-toolbox-78dfff7f4f-5bzt8:/mnt/dumpsdata/otherdumps/wikibase/wikidatawiki$ du -shc 202507* 202508* 202509*
859M	20250716
1.2G	20250718
229G	20250721
477G	20250728
862M	20250730
1.2G	20250801
477G	20250804
864M	20250806
1.2G	20250808
477G	20250811
869M	20250813
1.2G	20250815
248G	20250819
231G	20250820
1.2G	20250822
478G	20250825
879M	20250827
1.2G	20250829
478G	20250901
881M	20250903
1.3G	20250905
479G	20250908
903M	20250910
1.3G	20250912
479G	20250915
103G	20250916
908M	20250917
103G	20250918
1.3G	20250919
879G	20250922
910M	20250924
3.5G	20250926
0	20250929
5.1T	total

Now I can remove these files.

runuser@mediawiki-dumps-legacy-sync-toolbox-78dfff7f4f-5bzt8:/mnt/dumpsdata/otherdumps/wikibase/wikidatawiki$ rm -rf 202507* 202508* 202509*
runuser@mediawiki-dumps-legacy-sync-toolbox-78dfff7f4f-5bzt8:/mnt/dumpsdata/otherdumps/wikibase/wikidatawiki$
Mon, Oct 20, 10:58 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07)
BTullis added a comment to T407735: Configure automatic removal of old 'other' dumps from the cephfs dumps volume.

The original retention values for these different database dumps were configured according to this puppet fragment.

$keep_generator=['categoriesrdf:3', 'categoriesrdf/daily:3', 'cirrussearch:2', 'contenttranslation:3', 'growthmentorship:3', 'imageinfo:3', 'machinevision:3', 'mediatitles:3', 'pagetitles:3', 'shorturls:3', 'wikibase/wikidatawiki:3', 'wikibase/commonswiki:3']

The deletion was carried out by the cleanup_old_miscdumps.sh script. It is still used in the same way on the clouddumps servers, but with a greater retention time.

Mon, Oct 20, 10:54 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07)
BTullis added a comment to T406429: No Wikidata dumps for Week 40 of 2025 (recurring issue).

I have created T407735: Configure automatic removal of old 'other' dumps from the cephfs dumps volume, which describes the housekeeping issue in detail.

Mon, Oct 20, 10:37 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Data-Engineering, Essential-Work, Wikibase Reuse Team, Wikidata data dumps, Wikidata, Dumps-Generation
BTullis moved T407735: Configure automatic removal of old 'other' dumps from the cephfs dumps volume from Backlog - project to In Progress on the Data-Platform-SRE (2025.10.17 - 2025.11.07) board.
Mon, Oct 20, 10:36 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07)
BTullis created T407735: Configure automatic removal of old 'other' dumps from the cephfs dumps volume.
Mon, Oct 20, 10:36 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07)
BTullis added a comment to T406429: No Wikidata dumps for Week 40 of 2025 (recurring issue).

Good news! The patch to switch database servers seems to have worked.
The latest runs of mediawiki_commons_mediainfo_json_dump and mediawiki_wikidata_all_rdf_dump and mediawiki_wikidata_truthy_rdf_dump are all back to their normal duration.

Mon, Oct 20, 10:00 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Data-Engineering, Essential-Work, Wikibase Reuse Team, Wikidata data dumps, Wikidata, Dumps-Generation

Fri, Oct 17

BTullis added a comment to T405360: Implement an Airflow operator for moving data from point A to B.

As an experiment, I'm going to build a version of sync-utils that has support for hdfs-fuse mounts.

Fri, Oct 17, 5:17 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Wikimedia Enterprise - Content Integrity, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Wikimedia Enterprise, Essential-Work
BTullis moved T407123: Mirror OpenSearch repos from upstream from In Progress to Needs Review on the Data-Platform-SRE (2025.10.17 - 2025.11.07) board.
Fri, Oct 17, 5:14 PM · Patch-For-Review, Data-Platform-SRE (2025.10.17 - 2025.11.07)
BTullis updated the task description for T407199: Pin opensearch and logstash package versions in puppet to avoid updates when we mirror the upstream repositories.
Fri, Oct 17, 4:15 PM · OKR-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Patch-For-Review
BTullis added a comment to T407199: Pin opensearch and logstash package versions in puppet to avoid updates when we mirror the upstream repositories.

I have updated the deb in these components.

btullis@apt1002:~$ sudo -i reprepro -C thirdparty/elasticsearch-curator5 includedeb bullseye-wikimedia /srv/wikimedia/pool/thirdparty/opensearch2/e/elasticsearch-curator/elasticsearch-curator_5.8.5-1~wmf5+deb11u1_amd64.deb
Exporting indices...
btullis@apt1002:~$ sudo -i reprepro -C thirdparty/elasticsearch-curator5 includedeb bookworm-wikimedia /srv/wikimedia/pool/thirdparty/opensearch2/e/elasticsearch-curator/elasticsearch-curator_5.8.5-1~wmf5+deb12u1_amd64.deb
Exporting indices...
Deleting files no longer referenced...
Fri, Oct 17, 4:01 PM · OKR-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Patch-For-Review
BTullis added a comment to T407199: Pin opensearch and logstash package versions in puppet to avoid updates when we mirror the upstream repositories.

Also, one more thing occurs to me, which is that we need to change the way that we install elasticsearch-curator.
At the moment, we install a verison that we have copied to the thirdparty/opensearch1 and thirdparty/opensearch2 repos.

Fri, Oct 17, 3:55 PM · OKR-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Patch-For-Review
BTullis moved T407199: Pin opensearch and logstash package versions in puppet to avoid updates when we mirror the upstream repositories from In Progress to Needs Review on the Data-Platform-SRE (2025.10.17 - 2025.11.07) board.

I think that these three patches are all ready for a review now.

Fri, Oct 17, 3:38 PM · OKR-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Patch-For-Review
BTullis renamed T406579: Deploy FerretDB for GrowthBook from Deploy FerretDB and GrowthBook to Deploy FerretDB for GrowthBook.
Fri, Oct 17, 9:53 AM · Patch-For-Review, Data-Platform-SRE (2025.10.17 - 2025.11.07), OKR-Work, Experimentation Lab
BTullis triaged T406578: Deploy Postgres for Growthbook as High priority.
Fri, Oct 17, 9:48 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07), OKR-Work, Experimentation Lab
BTullis triaged T406579: Deploy FerretDB for GrowthBook as High priority.
Fri, Oct 17, 9:47 AM · Patch-For-Review, Data-Platform-SRE (2025.10.17 - 2025.11.07), OKR-Work, Experimentation Lab
BTullis edited projects for T406579: Deploy FerretDB for GrowthBook, added: Data-Platform-SRE (2025.10.17 - 2025.11.07); removed Data-Platform-SRE.
Fri, Oct 17, 9:47 AM · Patch-For-Review, Data-Platform-SRE (2025.10.17 - 2025.11.07), OKR-Work, Experimentation Lab
BTullis triaged T405749: [EPIC] Deploy GrowthBook as High priority.
Fri, Oct 17, 9:47 AM · OKR-Work, Data-Platform-SRE, Experimentation Lab, Epic
BTullis claimed T407609: Upgrade our GrowthBook container image to version 4.1.
Fri, Oct 17, 9:46 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07), OKR-Work, Experimentation Lab
BTullis created T407609: Upgrade our GrowthBook container image to version 4.1.
Fri, Oct 17, 9:46 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07), OKR-Work, Experimentation Lab
BTullis merged T406767: Add dbt related packages to conda-analytics into T406766: Add dbt related packages to conda-analytics.
Fri, Oct 17, 8:56 AM · OKR-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Data-Engineering (Q2 FY25/26 October 1st - December 31th)
BTullis merged task T406767: Add dbt related packages to conda-analytics into T406766: Add dbt related packages to conda-analytics.
Fri, Oct 17, 8:56 AM · OKR-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Data-Engineering (Q2 FY25/26 October 1st - December 31th)
BTullis moved T404867: Upgrade Envoy to v1.29.12 on wcqs and wdqs hosts from Backlog - project to Backlog - operations on the Data-Platform-SRE (2025.10.17 - 2025.11.07) board.
Fri, Oct 17, 8:55 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work, SRE, serviceops, envoy

Thu, Oct 16

BTullis added a comment to T368098: Dumps generation cause disruption to the production environment.

Hello again. It looks like the wikibase dumps performance issue described in T389199: Fix a performance regression affecting wikibase dumps when using mediawiki analytics replica of s8 - dbstore1009 may have returned, since September 25th 2025.
We are currently investigating in T406429: No Wikidata dumps for Week 40 of 2025 (recurring issue).

Thu, Oct 16, 12:12 PM · DPE-Mediawiki-Content, Epic, Data-Engineering, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Dumps-Generation, SRE
BTullis added a comment to T395881: Set up x1 replication to Wiki Replicas.

I created T407485 to track the work required to add this section to an-redacteddb1001 and set up the initial replication.

Thu, Oct 16, 11:50 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work, Data-Engineering, Data-Services, Data-Persistence, cloud-services-team, Privacy Engineering
BTullis merged T407486: Set up x1 replication to an-redacteddb1001 into T407485: Set up x1 replication to an-redacteddb1001.
Thu, Oct 16, 11:39 AM · Essential-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Data-Engineering, Data-Services, Data-Persistence, cloud-services-team, Privacy Engineering
BTullis merged task T407486: Set up x1 replication to an-redacteddb1001 into T407485: Set up x1 replication to an-redacteddb1001.
Thu, Oct 16, 11:39 AM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Data-Engineering, Data-Services, Data-Persistence, cloud-services-team, Privacy Engineering
BTullis moved T407486: Set up x1 replication to an-redacteddb1001 from Backlog - project to Backlog - operations on the Data-Platform-SRE (2025.09.26 - 2025.10.17) board.
Thu, Oct 16, 11:38 AM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Data-Engineering, Data-Services, Data-Persistence, cloud-services-team, Privacy Engineering
BTullis created T407486: Set up x1 replication to an-redacteddb1001.
Thu, Oct 16, 11:38 AM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Data-Engineering, Data-Services, Data-Persistence, cloud-services-team, Privacy Engineering
BTullis created T407485: Set up x1 replication to an-redacteddb1001.
Thu, Oct 16, 11:38 AM · Essential-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Data-Engineering, Data-Services, Data-Persistence, cloud-services-team, Privacy Engineering
BTullis added a comment to T405517: Make the shell group analytics-privatedata-users less confusing.

Thanks all for raising this ticket and for your kind feedback so far. I totally agree that:

analytics-privatedata-users is confusing for both applicants, and the SREs that action the applications.

I hope that we can make some quick-win improvements of the docs and processes that will benefit all of these stakeholders.
However, as @MoritzMuehlenhoff and @elukey mentioned, the underlying reason for the granularity in the levels of access is that there is complexity in the underlying systems.

Thu, Oct 16, 10:42 AM · Data-Platform-SRE, SRE
BTullis closed T407411: an-test-master1002 has had Puppet disabled for a month as Resolved.

Oops. Sorry about that. This was my oversight. I have re-enabled it and run puppet, which ran cleanly.

The last Puppet run was at Thu Sep 18 13:16:27 UTC 2025 (40037 minutes ago). Puppet is disabled. btullis-T404871 - btullis
Thu, Oct 16, 8:36 AM · Data-Platform-SRE (2025.09.26 - 2025.10.17)

Tue, Oct 14

BTullis added a comment to T406856: Reduce size of analytics/superset/deploy.git Gerrit repo.

Agreed. Please feel free to delete it from Gerrit. We no longer need it. Thanks @hashar for checking.

Tue, Oct 14, 4:39 PM · Data-Engineering, Release-Engineering-Team, Gerrit
BTullis claimed T403522: Enable access to Airflow for kostajh/iPod OpenSearch project.
Tue, Oct 14, 4:01 PM · OKR-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), iPoid-Service (IPoid OpenSearch)
BTullis moved T407199: Pin opensearch and logstash package versions in puppet to avoid updates when we mirror the upstream repositories from Backlog - project to In Progress on the Data-Platform-SRE (2025.09.26 - 2025.10.17) board.
Tue, Oct 14, 3:57 PM · OKR-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Patch-For-Review
BTullis moved T407165: Upgrade the ceph-csi-plugin to the latest release compatible with kubernetes version 1.31 from Backlog - project to Reported on the Data-Platform-SRE (2025.09.26 - 2025.10.17) board.
Tue, Oct 14, 2:51 PM · Data-Platform-SRE (2025.09.26 - 2025.10.17)
BTullis edited projects for T407165: Upgrade the ceph-csi-plugin to the latest release compatible with kubernetes version 1.31, added: Data-Platform-SRE (2025.09.26 - 2025.10.17); removed Data-Platform-SRE.
Tue, Oct 14, 2:51 PM · Data-Platform-SRE (2025.09.26 - 2025.10.17)
BTullis triaged T407199: Pin opensearch and logstash package versions in puppet to avoid updates when we mirror the upstream repositories as High priority.
Tue, Oct 14, 10:17 AM · OKR-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Patch-For-Review
BTullis created T407199: Pin opensearch and logstash package versions in puppet to avoid updates when we mirror the upstream repositories.
Tue, Oct 14, 10:16 AM · OKR-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Patch-For-Review
BTullis claimed T407123: Mirror OpenSearch repos from upstream.
Tue, Oct 14, 12:08 AM · Patch-For-Review, Data-Platform-SRE (2025.10.17 - 2025.11.07)

Mon, Oct 13

BTullis moved T402943: Repeated failures to resolve an-master100[3-4] from an-launcher1002 - resulting in pipeline failures from Blocked/Waiting to In Progress on the Data-Platform-SRE (2025.09.26 - 2025.10.17) board.
Mon, Oct 13, 11:06 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work
BTullis moved T406876: Ensure external access to opensearch-test cluster from Backlog - project to In Progress on the Data-Platform-SRE (2025.09.26 - 2025.10.17) board.
Mon, Oct 13, 10:48 PM · OKR-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Patch-For-Review
BTullis moved T406222: Add druid coordinator service to LVS for the druid_public cluster. from Backlog - project to Backlog - operations on the Data-Platform-SRE (2025.09.26 - 2025.10.17) board.
Mon, Oct 13, 10:47 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work
BTullis moved T406587: Repeated reimage failures on WDQS hosts from Backlog - project to Backlog - operations on the Data-Platform-SRE (2025.09.26 - 2025.10.17) board.
Mon, Oct 13, 10:47 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work, Wikidata-Query-Service, Wikidata
BTullis moved T406658: Create automation to verify WDQS allowlist operations from Backlog - project to Backlog - operations on the Data-Platform-SRE (2025.09.26 - 2025.10.17) board.
Mon, Oct 13, 10:47 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work
BTullis moved T402306: Simplify the datahub ingestion pipelines from Backlog - project to Backlog - operations on the Data-Platform-SRE (2025.09.26 - 2025.10.17) board.
Mon, Oct 13, 10:47 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work
BTullis moved T406656: Reimage failed after prompt...is prompt needed? from Backlog - project to Done on the Data-Platform-SRE (2025.09.26 - 2025.10.17) board.
Mon, Oct 13, 10:47 PM · Infrastructure-Foundations, Essential-Work, Data-Platform-SRE (2025.09.26 - 2025.10.17), SRE, ops-codfw, DC-Ops
BTullis moved T406371: Bump max mapped task for all Airflow instances to 1200 from Backlog - operations to Quick Wins on the Data-Platform-SRE (2025.09.26 - 2025.10.17) board.
Mon, Oct 13, 10:47 PM · Patch-For-Review, Essential-Work, Data-Platform-SRE (2025.09.26 - 2025.10.17)
BTullis removed a project from T407126: OpenSearch on K8s: build OpenSearch images for latest versions 2x and 3x: Essential-Work.
Mon, Oct 13, 10:46 PM · OKR-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07)
BTullis removed a project from T407125: Refactor our OpenSearch chart for upstream version 2.8.0: Essential-Work.
Mon, Oct 13, 10:45 PM · OKR-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07)
BTullis removed a project from T405985: OpenSearch on K8s: migrate from 2.7.0 to 2.8.0 version of the chart: Essential-Work.
Mon, Oct 13, 10:45 PM · OKR-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07)
BTullis removed a project from T407123: Mirror OpenSearch repos from upstream: Essential-Work.
Mon, Oct 13, 10:45 PM · Patch-For-Review, Data-Platform-SRE (2025.10.17 - 2025.11.07)
BTullis closed T396478: EPIC: Build dse-k8s-codfw Kubernetes cluster, a subtask of T362105: EPIC: OpenSearch on K8s (formerly Mutualized opensearch cluster) - FY25/26 WE4.2.6, as Resolved.
Mon, Oct 13, 10:33 PM · Patch-For-Review, Epic, Data-Platform-SRE
BTullis closed T396478: EPIC: Build dse-k8s-codfw Kubernetes cluster as Resolved.

Being bold and closing this epic.

Mon, Oct 13, 10:33 PM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Epic
BTullis added a comment to T396478: EPIC: Build dse-k8s-codfw Kubernetes cluster.

Still to do:

  • check if there are improvements in the Ceph plugin to see if it is worth upgrading the Ceph plugin or something else (kernel?), by reading the release notes
  • if needed, do the appropriate upgrades (create a subtask for it)
Mon, Oct 13, 10:32 PM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Epic
BTullis triaged T407166: Upgrade the ceph-csi-plugin to the latest release compatible with kubernetes version 1.31 as Medium priority.
Mon, Oct 13, 10:27 PM · Data-Platform-SRE
BTullis added a subtask for T341984: Update Kubernetes clusters to 1.31: T407166: Upgrade the ceph-csi-plugin to the latest release compatible with kubernetes version 1.31.
Mon, Oct 13, 10:27 PM · Patch-For-Review, collaboration-services, Data-Platform-SRE, Kubernetes, Prod-Kubernetes, serviceops
BTullis added a parent task for T407166: Upgrade the ceph-csi-plugin to the latest release compatible with kubernetes version 1.31: T341984: Update Kubernetes clusters to 1.31.
Mon, Oct 13, 10:27 PM · Data-Platform-SRE
BTullis created T407166: Upgrade the ceph-csi-plugin to the latest release compatible with kubernetes version 1.31.
Mon, Oct 13, 10:23 PM · Data-Platform-SRE
BTullis created T407165: Upgrade the ceph-csi-plugin to the latest release compatible with kubernetes version 1.31.
Mon, Oct 13, 10:23 PM · Data-Platform-SRE (2025.09.26 - 2025.10.17)
BTullis added a comment to T404576: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s-codfw cluster.

I removed the namespace that we have been using for tests.

root@deploy2002:/srv/deployment-charts/helmfile.d/admin_ng# kubectl delete namespace stevemunene-pvc-tests
namespace "stevemunene-pvc-tests" deleted
Mon, Oct 13, 10:06 PM · Patch-For-Review, Data-Platform-SRE (2025.09.26 - 2025.10.17)
BTullis closed T404576: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s-codfw cluster as Resolved.
Mon, Oct 13, 9:32 PM · Patch-For-Review, Data-Platform-SRE (2025.09.26 - 2025.10.17)
BTullis closed T404576: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s-codfw cluster, a subtask of T396478: EPIC: Build dse-k8s-codfw Kubernetes cluster, as Resolved.
Mon, Oct 13, 9:32 PM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Epic
BTullis triaged T404576: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s-codfw cluster as High priority.
Mon, Oct 13, 9:31 PM · Patch-For-Review, Data-Platform-SRE (2025.09.26 - 2025.10.17)
BTullis added a comment to T404576: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s-codfw cluster.

And now the filesystem-based rbd volumes are working.
I used a PVC spec like this:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: fs-pvc
  namespace: stevemunene-pvc-tests
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 1Gi
  storageClassName: ceph-rbd-ssd

I used a pod spec like this:

apiVersion: v1
kind: Pod
metadata:
  name: pod-with-fs
  namespace: stevemunene-pvc-tests
spec:
  containers:
    - name: do-nothing
      image: docker-registry.discovery.wmnet/bookworm:20240630
      command: ["/bin/sh", "-c"]
      args: ["tail -f /dev/null"]
      volumeMounts:
        - name: data
          mountPath: /mnt
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
           drop:
           - ALL
        runAsNonRoot: true
        runAsUser: 65534
        seccompProfile:
          type: RuntimeDefault
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: fs-pvc
        readOnly: false

I created the PVC.

root@deploy2002:/home/btullis# kubectl -f fs-pvc.yaml apply
persistentvolumeclaim/fs-pvc created

Then I checked that the PV had been provisioned and correctly bound to the PVC.

root@deploy2002:/home/btullis# kubectl -n stevemunene-pvc-tests get pvc
NAME     STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
fs-pvc   Bound    pvc-ca1d1e87-f24e-4e5a-b4b6-c8ea897ee45d   1Gi        RWO            ceph-rbd-ssd   <unset>                 15s

Then I created the pod.

root@deploy2002:/home/btullis# kubectl -f fs-pod.yaml apply
pod/pod-with-fs created

I was able to exec into the pod and verify that the filesystem had been created and mounted correctly.

root@deploy2002:/home/btullis# kubectl -n stevemunene-pvc-tests exec -it pod-with-fs -- bash
Mon, Oct 13, 9:31 PM · Patch-For-Review, Data-Platform-SRE (2025.09.26 - 2025.10.17)
BTullis added a comment to T404576: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s-codfw cluster.

The raw disk access via the rbd plugin is now working, too.
Here is my PVC spec.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: raw-block-pvc
  namespace: stevemunene-pvc-tests
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Block
  resources:
    requests:
      storage: 1Gi
  storageClassName: ceph-rbd-ssd

Here is the pod spec.

apiVersion: v1
kind: Pod
metadata:
  name: pod-with-raw-block-volume
  namespace: stevemunene-pvc-tests
spec:
  containers:
    - name: do-nothing
      image: docker-registry.discovery.wmnet/bookworm:20240630
      command: ["/bin/sh", "-c"]
      args: ["tail -f /dev/null"]
      volumeDevices:
        - name: data
          devicePath: /dev/xvda
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
           drop:
           - ALL
        runAsNonRoot: true
        runAsUser: 65534
        seccompProfile:
          type: RuntimeDefault
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: raw-block-pvc

I was able to create the pvc like this.

root@deploy2002:/home/btullis# kubectl -f raw-block-pvc.yaml apply
persistentvolumeclaim/raw-block-pvc created

I could check that the PV was provisioned bound correctly to the PVC.

root@deploy2002:/home/btullis# kubectl -n stevemunene-pvc-tests get pvc
NAME            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
raw-block-pvc   Bound    pvc-575ce61b-1de4-4c8f-961b-1a7f194eca9f   1Gi        RWO            ceph-rbd-ssd   <unset>                 99s

I could then create the pod that gets the device assigned.

root@deploy2002:/home/btullis# kubectl -f raw-block-pod.yaml apply
pod/pod-with-raw-block-volume created

I could then exec into the pod and check that the device is present.

root@deploy2002:/home/btullis# kubectl -n stevemunene-pvc-tests exec -it pod-with-raw-block-volume -- bash
nobody@pod-with-raw-block-volume:/$ ls -l /dev/xvda 
brw-rw---- 1 root disk 252, 0 Oct 13 21:09 /dev/xvda

I can't format the device, because I would have to be either root or a member of the disk group, but that's fine. This is just a test.

Mon, Oct 13, 9:19 PM · Patch-For-Review, Data-Platform-SRE (2025.09.26 - 2025.10.17)
BTullis closed T193473: Add HTTPS support to wdqs-internal service as Resolved.

I believe that this is now complete.

Mon, Oct 13, 8:54 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work, Wikidata, Wikidata-Query-Service
BTullis closed T193473: Add HTTPS support to wdqs-internal service, a subtask of T297555: [epic] Brian's onboarding to the Search Platform team, as Resolved.
Mon, Oct 13, 8:54 PM · Discovery-Search (Current work), Epic
BTullis added a comment to T404576: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s-codfw cluster.

I have verified that the cephfs plugin is working.

Mon, Oct 13, 2:46 PM · Patch-For-Review, Data-Platform-SRE (2025.09.26 - 2025.10.17)
BTullis closed T406985: Bring dse-k8s-worker2003.codfw.wmnet into production as Resolved.

It's a bit difficult to validate that everything is working when we haven't yet got any workload, but it seems OK.
I will resolve for now, but revisit if we have any issues with it down the line.

Mon, Oct 13, 2:10 PM · Data-Platform-SRE (2025.09.26 - 2025.10.17)
BTullis updated the task description for T406985: Bring dse-k8s-worker2003.codfw.wmnet into production.
Mon, Oct 13, 2:08 PM · Data-Platform-SRE (2025.09.26 - 2025.10.17)
BTullis added a comment to T406429: No Wikidata dumps for Week 40 of 2025 (recurring issue).

It's possible that the performance regression is similar to that observed here:

Mon, Oct 13, 10:47 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Data-Engineering, Essential-Work, Wikibase Reuse Team, Wikidata data dumps, Wikidata, Dumps-Generation
BTullis added a project to T406429: No Wikidata dumps for Week 40 of 2025 (recurring issue): Data-Engineering.
Mon, Oct 13, 9:52 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Data-Engineering, Essential-Work, Wikibase Reuse Team, Wikidata data dumps, Wikidata, Dumps-Generation
BTullis added a comment to T406429: No Wikidata dumps for Week 40 of 2025 (recurring issue).

I'm sorry to say that I have very little positive news to report on this.

Mon, Oct 13, 9:51 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Data-Engineering, Essential-Work, Wikibase Reuse Team, Wikidata data dumps, Wikidata, Dumps-Generation
BTullis closed T406394: Degraded RAID on druid1011 as Resolved.

This is now complete.

root@druid1011:~# cat /proc/mdstat 
Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] 
md0 : active raid10 sde2[8] sdc2[2] sdd2[3] sdg2[7] sda2[0] sdb2[1] sdf2[5] sdh2[6]
      3749068800 blocks super 1.2 512K chunks 2 near-copies [8/8] [UUUUUUUU]
      bitmap: 1/28 pages [4KB], 65536KB chunk
Mon, Oct 13, 9:26 AM · Essential-Work, Data-Platform-SRE (2025.09.26 - 2025.10.17), DC-Ops, SRE, ops-eqiad

Fri, Oct 10

BTullis moved T406394: Degraded RAID on druid1011 from Backlog - project to Blocked/Waiting on the Data-Platform-SRE (2025.09.26 - 2025.10.17) board.
Fri, Oct 10, 3:41 PM · Essential-Work, Data-Platform-SRE (2025.09.26 - 2025.10.17), DC-Ops, SRE, ops-eqiad
BTullis added a comment to T406394: Degraded RAID on druid1011.

Thanks @Jclark-ctr - Looks good. I can see that the drive showed up as /dev/sde and had no partition table.

[Fri Oct 10 13:14:56 2025] scsi 0:2:4:0: Direct-Access     ATA      SSDSC2KB960G8R   DL63 PQ: 0 ANSI: 6
[Fri Oct 10 13:14:56 2025] sd 0:2:4:0: Attached scsi generic sg4 type 0
[Fri Oct 10 13:14:56 2025] sd 0:2:4:0: [sde] 1875385008 512-byte logical blocks: (960 GB/894 GiB)
[Fri Oct 10 13:14:56 2025] sd 0:2:4:0: [sde] 4096-byte physical blocks
[Fri Oct 10 13:14:56 2025] sd 0:2:4:0: [sde] Write Protect is off
[Fri Oct 10 13:14:56 2025] sd 0:2:4:0: [sde] Mode Sense: 6b 00 10 08
[Fri Oct 10 13:14:56 2025] sd 0:2:4:0: [sde] Write cache: enabled, read cache: enabled, supports DPO and FUA
[Fri Oct 10 13:14:56 2025] sd 0:2:4:0: [sde] Attached SCSI disk

I checked the other devices.

btullis@druid1011:~$ lsblk
NAME           MAJ:MIN RM   SIZE RO TYPE   MOUNTPOINT
sda              8:0    0 894.3G  0 disk   
├─sda1           8:1    0   285M  0 part   
└─sda2           8:2    0   894G  0 part   
  └─md0          9:0    0   3.5T  0 raid10 
    ├─vg0-swap 253:0    0   976M  0 lvm    [SWAP]
    ├─vg0-root 253:1    0  74.5G  0 lvm    /
    └─vg0-srv  253:2    0   2.7T  0 lvm    /srv
sdb              8:16   0 894.3G  0 disk   
├─sdb1           8:17   0   285M  0 part   
└─sdb2           8:18   0   894G  0 part   
  └─md0          9:0    0   3.5T  0 raid10 
    ├─vg0-swap 253:0    0   976M  0 lvm    [SWAP]
    ├─vg0-root 253:1    0  74.5G  0 lvm    /
    └─vg0-srv  253:2    0   2.7T  0 lvm    /srv
sdc              8:32   0 894.3G  0 disk   
├─sdc1           8:33   0   285M  0 part   
└─sdc2           8:34   0   894G  0 part   
  └─md0          9:0    0   3.5T  0 raid10 
    ├─vg0-swap 253:0    0   976M  0 lvm    [SWAP]
    ├─vg0-root 253:1    0  74.5G  0 lvm    /
    └─vg0-srv  253:2    0   2.7T  0 lvm    /srv
sdd              8:48   0 894.3G  0 disk   
├─sdd1           8:49   0   285M  0 part   
└─sdd2           8:50   0   894G  0 part   
  └─md0          9:0    0   3.5T  0 raid10 
    ├─vg0-swap 253:0    0   976M  0 lvm    [SWAP]
    ├─vg0-root 253:1    0  74.5G  0 lvm    /
    └─vg0-srv  253:2    0   2.7T  0 lvm    /srv
sde              8:64   0 894.3G  0 disk   
sdf              8:80   0 894.3G  0 disk   
├─sdf1           8:81   0   285M  0 part   
└─sdf2           8:82   0   894G  0 part   
  └─md0          9:0    0   3.5T  0 raid10 
    ├─vg0-swap 253:0    0   976M  0 lvm    [SWAP]
    ├─vg0-root 253:1    0  74.5G  0 lvm    /
    └─vg0-srv  253:2    0   2.7T  0 lvm    /srv
sdg              8:96   0 894.3G  0 disk   
├─sdg1           8:97   0   285M  0 part   
└─sdg2           8:98   0   894G  0 part   
  └─md0          9:0    0   3.5T  0 raid10 
    ├─vg0-swap 253:0    0   976M  0 lvm    [SWAP]
    ├─vg0-root 253:1    0  74.5G  0 lvm    /
    └─vg0-srv  253:2    0   2.7T  0 lvm    /srv
sdh              8:112  0 894.3G  0 disk   
├─sdh1           8:113  0   285M  0 part   
└─sdh2           8:114  0   894G  0 part   
  └─md0          9:0    0   3.5T  0 raid10 
    ├─vg0-swap 253:0    0   976M  0 lvm    [SWAP]
    ├─vg0-root 253:1    0  74.5G  0 lvm    /
    └─vg0-srv  253:2    0   2.7T  0 lvm    /srv

I took a copy of the partition table from /dev/sdh and applied it to /dev/sde

root@druid1011:~# sfdisk -d /dev/sdh | sfdisk /dev/sde
Checking that no-one is using this disk right now ... OK
Fri, Oct 10, 3:41 PM · Essential-Work, Data-Platform-SRE (2025.09.26 - 2025.10.17), DC-Ops, SRE, ops-eqiad