User Details
- User Since
- Jun 29 2021, 9:56 AM (225 w, 1 d)
- Availability
- Available
- IRC Nick
- btullis
- LDAP User
- Btullis
- MediaWiki User
- BTullis (WMF) [ Global Accounts ]
Yesterday
Moving to in-progress, since all active workload has been migrated to an-launcher1003.
It's also interesting that not only does GrowthBook mention their support for FerretDB:
https://docs.growthbook.io/self-host/ferretdb
Just noting that we would probably want to avoid the package that has dbgsym in its name.
From here: https://docs.ferretdb.io/installation/documentdb/deb/
- For most use cases, we recommend using the production package (e.g., documentdb.deb).
- For debugging purposes, use the development package (contains either -dev or -dbgsym suffix e.g., documentdb-dev.deb/documentdb-dbgsym.deb). It includes features that significantly slow down performance and is not recommended for production use.
Since I had rebuilt version 0.0.39 of conda-analytics, I updated the version on the apt servers.
btullis@apt1002:~$ wget https://gitlab.wikimedia.org/api/v4/projects/359/packages/generic/conda-analytics/0.0.39/conda-analytics-0.0.39_amd64.deb --2025-10-22 17:20:30-- https://gitlab.wikimedia.org/api/v4/projects/359/packages/generic/conda-analytics/0.0.39/conda-analytics-0.0.39_amd64.deb Resolving gitlab.wikimedia.org (gitlab.wikimedia.org)... 2620:0:861:2:208:80:154:145, 208.80.154.145 Connecting to gitlab.wikimedia.org (gitlab.wikimedia.org)|2620:0:861:2:208:80:154:145|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1094453352 (1.0G) [binary/octet-stream] Saving to: ‘conda-analytics-0.0.39_amd64.deb’
We have now migrated all of the workload from an-launcher1002 to an-launcher1003, so I think that we can tentatively call this done.
We got this working with spark in session mode, using the dbt-core and dbt-spark packages in conda-analytics version 0.0.39.
I pushed out the version 0.0.39 package to the test-cluster.
btullis@cumin1003:~$ generate-debdeploy-spec <snip>
I'm just flagging here an investigation that I looked at as part of T405360.
In T402943#11297764 we can see that we currently use hdfs-rsync with an NFS source (clouddumps1002) and an HDFS target.
With a bit of investigation, it's clear which jobs are the most network-heavy. It's all of the jobs covered by this patch.
We have migrated most of the workload to an-launcher1003 and it has been running since yesterday without any errors.
One thing that is interesting is that one of the jobs is already exceeding the network throughput that it would have been able to achieve on an-launcher1002.
https://grafana.wikimedia.org/goto/0ChlpHgDR?orgId=1
Tue, Oct 21
I have built a version 0.0.39 of conda-analytics and added it to apt.wikimedia.org
https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/jobs/653391
I'm removing myself as the active assignee, since I haven't got time to work on this right now.
It should be a relatively easy job to add the escalation details to the login.html fragment, now that we know that the template is rendering..
I have this patch to conda-analytics for review:
https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/59
I'll make a start on this.
Mon, Oct 20
This image is now available.
btullis@barracuda:~/wmf/growthbook$ docker run -it docker-registry.wikimedia.org/repos/data-engineering/growthbook:2025-10-20-163649-7d2ca6af3de86d10c9df30819307ca1cd0830a7b Unable to find image 'docker-registry.wikimedia.org/repos/data-engineering/growthbook:2025-10-20-163649-7d2ca6af3de86d10c9df30819307ca1cd0830a7b' locally 2025-10-20-163649-7d2ca6af3de86d10c9df30819307ca1cd0830a7b: Pulling from repos/data-engineering/growthbook 77a1eeafdb5a: Already exists 05f6e46ebee1: Already exists bc796e87bac2: Pull complete b9977baba3dc: Pull complete b5e8a58d4622: Pull complete 84be992ecae9: Pull complete f014e1080d8f: Pull complete 74862184cf26: Pull complete Digest: sha256:d0e1d6d6e29d9e893bfd2dfd29553b7a9d32374a139cc9b069bfdfa4f8bb14e9 Status: Downloaded newer image for docker-registry.wikimedia.org/repos/data-engineering/growthbook:2025-10-20-163649-7d2ca6af3de86d10c9df30819307ca1cd0830a7b yarn run v1.22.22 $ wsrun -p 'back-end' -p 'front-end' --no-prefix -c start $ node dist/server.js $ next start ▲ Next.js 14.2.26 - Local: http://localhost:3000
I have manually removed 5.1 TB of old dumps from the cephfs volume in T407735#11289038 and I have manually triggered a new run of the sync_wikibase_wikidatawiki_dumps DAG.
This should unblock the publishing of the latest dumps in T406429: No Wikidata dumps for Week 40 of 2025 (recurring issue).
runuser@mediawiki-dumps-legacy-sync-toolbox-78dfff7f4f-5bzt8:/mnt/dumpsdata/otherdumps/wikibase/wikidatawiki$ du -shc 202507* 202508* 202509* 859M 20250716 1.2G 20250718 229G 20250721 477G 20250728 862M 20250730 1.2G 20250801 477G 20250804 864M 20250806 1.2G 20250808 477G 20250811 869M 20250813 1.2G 20250815 248G 20250819 231G 20250820 1.2G 20250822 478G 20250825 879M 20250827 1.2G 20250829 478G 20250901 881M 20250903 1.3G 20250905 479G 20250908 903M 20250910 1.3G 20250912 479G 20250915 103G 20250916 908M 20250917 103G 20250918 1.3G 20250919 879G 20250922 910M 20250924 3.5G 20250926 0 20250929 5.1T total
Now I can remove these files.
runuser@mediawiki-dumps-legacy-sync-toolbox-78dfff7f4f-5bzt8:/mnt/dumpsdata/otherdumps/wikibase/wikidatawiki$ rm -rf 202507* 202508* 202509* runuser@mediawiki-dumps-legacy-sync-toolbox-78dfff7f4f-5bzt8:/mnt/dumpsdata/otherdumps/wikibase/wikidatawiki$
The original retention values for these different database dumps were configured according to this puppet fragment.
$keep_generator=['categoriesrdf:3', 'categoriesrdf/daily:3', 'cirrussearch:2', 'contenttranslation:3', 'growthmentorship:3', 'imageinfo:3', 'machinevision:3', 'mediatitles:3', 'pagetitles:3', 'shorturls:3', 'wikibase/wikidatawiki:3', 'wikibase/commonswiki:3']
The deletion was carried out by the cleanup_old_miscdumps.sh script. It is still used in the same way on the clouddumps servers, but with a greater retention time.
I have created T407735: Configure automatic removal of old 'other' dumps from the cephfs dumps volume, which describes the housekeeping issue in detail.
Good news! The patch to switch database servers seems to have worked.
The latest runs of mediawiki_commons_mediainfo_json_dump and mediawiki_wikidata_all_rdf_dump and mediawiki_wikidata_truthy_rdf_dump are all back to their normal duration.
Fri, Oct 17
As an experiment, I'm going to build a version of sync-utils that has support for hdfs-fuse mounts.
I have updated the deb in these components.
btullis@apt1002:~$ sudo -i reprepro -C thirdparty/elasticsearch-curator5 includedeb bullseye-wikimedia /srv/wikimedia/pool/thirdparty/opensearch2/e/elasticsearch-curator/elasticsearch-curator_5.8.5-1~wmf5+deb11u1_amd64.deb Exporting indices... btullis@apt1002:~$ sudo -i reprepro -C thirdparty/elasticsearch-curator5 includedeb bookworm-wikimedia /srv/wikimedia/pool/thirdparty/opensearch2/e/elasticsearch-curator/elasticsearch-curator_5.8.5-1~wmf5+deb12u1_amd64.deb Exporting indices... Deleting files no longer referenced...
Also, one more thing occurs to me, which is that we need to change the way that we install elasticsearch-curator.
At the moment, we install a verison that we have copied to the thirdparty/opensearch1 and thirdparty/opensearch2 repos.
I think that these three patches are all ready for a review now.
Thu, Oct 16
Hello again. It looks like the wikibase dumps performance issue described in T389199: Fix a performance regression affecting wikibase dumps when using mediawiki analytics replica of s8 - dbstore1009 may have returned, since September 25th 2025.
We are currently investigating in T406429: No Wikidata dumps for Week 40 of 2025 (recurring issue).
I created T407485 to track the work required to add this section to an-redacteddb1001 and set up the initial replication.
Thanks all for raising this ticket and for your kind feedback so far. I totally agree that:
analytics-privatedata-users is confusing for both applicants, and the SREs that action the applications.
I hope that we can make some quick-win improvements of the docs and processes that will benefit all of these stakeholders.
However, as @MoritzMuehlenhoff and @elukey mentioned, the underlying reason for the granularity in the levels of access is that there is complexity in the underlying systems.
Oops. Sorry about that. This was my oversight. I have re-enabled it and run puppet, which ran cleanly.
The last Puppet run was at Thu Sep 18 13:16:27 UTC 2025 (40037 minutes ago). Puppet is disabled. btullis-T404871 - btullis
Tue, Oct 14
Agreed. Please feel free to delete it from Gerrit. We no longer need it. Thanks @hashar for checking.
Mon, Oct 13
Being bold and closing this epic.
I removed the namespace that we have been using for tests.
root@deploy2002:/srv/deployment-charts/helmfile.d/admin_ng# kubectl delete namespace stevemunene-pvc-tests namespace "stevemunene-pvc-tests" deleted
And now the filesystem-based rbd volumes are working.
I used a PVC spec like this:
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: fs-pvc namespace: stevemunene-pvc-tests spec: accessModes: - ReadWriteOnce volumeMode: Filesystem resources: requests: storage: 1Gi storageClassName: ceph-rbd-ssd
I used a pod spec like this:
apiVersion: v1 kind: Pod metadata: name: pod-with-fs namespace: stevemunene-pvc-tests spec: containers: - name: do-nothing image: docker-registry.discovery.wmnet/bookworm:20240630 command: ["/bin/sh", "-c"] args: ["tail -f /dev/null"] volumeMounts: - name: data mountPath: /mnt securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL runAsNonRoot: true runAsUser: 65534 seccompProfile: type: RuntimeDefault volumes: - name: data persistentVolumeClaim: claimName: fs-pvc readOnly: false
I created the PVC.
root@deploy2002:/home/btullis# kubectl -f fs-pvc.yaml apply persistentvolumeclaim/fs-pvc created
Then I checked that the PV had been provisioned and correctly bound to the PVC.
root@deploy2002:/home/btullis# kubectl -n stevemunene-pvc-tests get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE fs-pvc Bound pvc-ca1d1e87-f24e-4e5a-b4b6-c8ea897ee45d 1Gi RWO ceph-rbd-ssd <unset> 15s
Then I created the pod.
root@deploy2002:/home/btullis# kubectl -f fs-pod.yaml apply pod/pod-with-fs created
I was able to exec into the pod and verify that the filesystem had been created and mounted correctly.
root@deploy2002:/home/btullis# kubectl -n stevemunene-pvc-tests exec -it pod-with-fs -- bash
The raw disk access via the rbd plugin is now working, too.
Here is my PVC spec.
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: raw-block-pvc namespace: stevemunene-pvc-tests spec: accessModes: - ReadWriteOnce volumeMode: Block resources: requests: storage: 1Gi storageClassName: ceph-rbd-ssd
Here is the pod spec.
apiVersion: v1 kind: Pod metadata: name: pod-with-raw-block-volume namespace: stevemunene-pvc-tests spec: containers: - name: do-nothing image: docker-registry.discovery.wmnet/bookworm:20240630 command: ["/bin/sh", "-c"] args: ["tail -f /dev/null"] volumeDevices: - name: data devicePath: /dev/xvda securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL runAsNonRoot: true runAsUser: 65534 seccompProfile: type: RuntimeDefault volumes: - name: data persistentVolumeClaim: claimName: raw-block-pvc
I was able to create the pvc like this.
root@deploy2002:/home/btullis# kubectl -f raw-block-pvc.yaml apply persistentvolumeclaim/raw-block-pvc created
I could check that the PV was provisioned bound correctly to the PVC.
root@deploy2002:/home/btullis# kubectl -n stevemunene-pvc-tests get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE raw-block-pvc Bound pvc-575ce61b-1de4-4c8f-961b-1a7f194eca9f 1Gi RWO ceph-rbd-ssd <unset> 99s
I could then create the pod that gets the device assigned.
root@deploy2002:/home/btullis# kubectl -f raw-block-pod.yaml apply pod/pod-with-raw-block-volume created
I could then exec into the pod and check that the device is present.
root@deploy2002:/home/btullis# kubectl -n stevemunene-pvc-tests exec -it pod-with-raw-block-volume -- bash nobody@pod-with-raw-block-volume:/$ ls -l /dev/xvda brw-rw---- 1 root disk 252, 0 Oct 13 21:09 /dev/xvda
I can't format the device, because I would have to be either root or a member of the disk group, but that's fine. This is just a test.
I believe that this is now complete.
I have verified that the cephfs plugin is working.
It's a bit difficult to validate that everything is working when we haven't yet got any workload, but it seems OK.
I will resolve for now, but revisit if we have any issues with it down the line.
It's possible that the performance regression is similar to that observed here:
I'm sorry to say that I have very little positive news to report on this.
This is now complete.
root@druid1011:~# cat /proc/mdstat Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] md0 : active raid10 sde2[8] sdc2[2] sdd2[3] sdg2[7] sda2[0] sdb2[1] sdf2[5] sdh2[6] 3749068800 blocks super 1.2 512K chunks 2 near-copies [8/8] [UUUUUUUU] bitmap: 1/28 pages [4KB], 65536KB chunk
Fri, Oct 10
Thanks @Jclark-ctr - Looks good. I can see that the drive showed up as /dev/sde and had no partition table.
[Fri Oct 10 13:14:56 2025] scsi 0:2:4:0: Direct-Access ATA SSDSC2KB960G8R DL63 PQ: 0 ANSI: 6 [Fri Oct 10 13:14:56 2025] sd 0:2:4:0: Attached scsi generic sg4 type 0 [Fri Oct 10 13:14:56 2025] sd 0:2:4:0: [sde] 1875385008 512-byte logical blocks: (960 GB/894 GiB) [Fri Oct 10 13:14:56 2025] sd 0:2:4:0: [sde] 4096-byte physical blocks [Fri Oct 10 13:14:56 2025] sd 0:2:4:0: [sde] Write Protect is off [Fri Oct 10 13:14:56 2025] sd 0:2:4:0: [sde] Mode Sense: 6b 00 10 08 [Fri Oct 10 13:14:56 2025] sd 0:2:4:0: [sde] Write cache: enabled, read cache: enabled, supports DPO and FUA [Fri Oct 10 13:14:56 2025] sd 0:2:4:0: [sde] Attached SCSI disk
I checked the other devices.
btullis@druid1011:~$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 894.3G 0 disk ├─sda1 8:1 0 285M 0 part └─sda2 8:2 0 894G 0 part └─md0 9:0 0 3.5T 0 raid10 ├─vg0-swap 253:0 0 976M 0 lvm [SWAP] ├─vg0-root 253:1 0 74.5G 0 lvm / └─vg0-srv 253:2 0 2.7T 0 lvm /srv sdb 8:16 0 894.3G 0 disk ├─sdb1 8:17 0 285M 0 part └─sdb2 8:18 0 894G 0 part └─md0 9:0 0 3.5T 0 raid10 ├─vg0-swap 253:0 0 976M 0 lvm [SWAP] ├─vg0-root 253:1 0 74.5G 0 lvm / └─vg0-srv 253:2 0 2.7T 0 lvm /srv sdc 8:32 0 894.3G 0 disk ├─sdc1 8:33 0 285M 0 part └─sdc2 8:34 0 894G 0 part └─md0 9:0 0 3.5T 0 raid10 ├─vg0-swap 253:0 0 976M 0 lvm [SWAP] ├─vg0-root 253:1 0 74.5G 0 lvm / └─vg0-srv 253:2 0 2.7T 0 lvm /srv sdd 8:48 0 894.3G 0 disk ├─sdd1 8:49 0 285M 0 part └─sdd2 8:50 0 894G 0 part └─md0 9:0 0 3.5T 0 raid10 ├─vg0-swap 253:0 0 976M 0 lvm [SWAP] ├─vg0-root 253:1 0 74.5G 0 lvm / └─vg0-srv 253:2 0 2.7T 0 lvm /srv sde 8:64 0 894.3G 0 disk sdf 8:80 0 894.3G 0 disk ├─sdf1 8:81 0 285M 0 part └─sdf2 8:82 0 894G 0 part └─md0 9:0 0 3.5T 0 raid10 ├─vg0-swap 253:0 0 976M 0 lvm [SWAP] ├─vg0-root 253:1 0 74.5G 0 lvm / └─vg0-srv 253:2 0 2.7T 0 lvm /srv sdg 8:96 0 894.3G 0 disk ├─sdg1 8:97 0 285M 0 part └─sdg2 8:98 0 894G 0 part └─md0 9:0 0 3.5T 0 raid10 ├─vg0-swap 253:0 0 976M 0 lvm [SWAP] ├─vg0-root 253:1 0 74.5G 0 lvm / └─vg0-srv 253:2 0 2.7T 0 lvm /srv sdh 8:112 0 894.3G 0 disk ├─sdh1 8:113 0 285M 0 part └─sdh2 8:114 0 894G 0 part └─md0 9:0 0 3.5T 0 raid10 ├─vg0-swap 253:0 0 976M 0 lvm [SWAP] ├─vg0-root 253:1 0 74.5G 0 lvm / └─vg0-srv 253:2 0 2.7T 0 lvm /srv
I took a copy of the partition table from /dev/sdh and applied it to /dev/sde
root@druid1011:~# sfdisk -d /dev/sdh | sfdisk /dev/sde Checking that no-one is using this disk right now ... OK