Milestone for DPE SRE
Details
Fri, Oct 10
Tue, Oct 7
Re opening this task since we have had some issues using ceph on dse-k8s-codfw.
To test the integration, we tried a simple pvc definition as a raw block device
Wed, Oct 1
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs2017.codfw.wmnet with OS bullseye executed with errors:
- wdqs2017 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wdqs2017.codfw.wmnet" to get a root shell, but depending on the failure this may not work.
Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs1018.eqiad.wmnet with OS bullseye executed with errors:
- wdqs1018 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata (7) to Debian installer
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wdqs1018.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs2017.codfw.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs1018.eqiad.wmnet with OS bullseye
Change #1192890 merged by Bking:
[operations/puppet@production] wdqs-scholarly: Add wdqs2016 to load balancer pool
Change #1192890 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] wdqs-scholarly: Add wdqs2016 to load balancer pool
Tue, Sep 30
Mentioned in SAL (#wikimedia-operations) [2025-09-30T20:35:45Z] <bking@deploy2002> Finished deploy [wdqs/wdqs@fea7794]: T405978 (duration: 00m 10s)
Mentioned in SAL (#wikimedia-operations) [2025-09-30T20:35:40Z] <bking@deploy2002> Started deploy [wdqs/wdqs@fea7794]: T405978
Mentioned in SAL (#wikimedia-operations) [2025-09-30T20:33:58Z] <bking@deploy2002> Finished deploy [wdqs/wdqs@fea7794]: T405978 (duration: 00m 20s)
Mentioned in SAL (#wikimedia-operations) [2025-09-30T20:33:44Z] <bking@deploy2002> Started deploy [wdqs/wdqs@fea7794]: T405978
Change #1192626 merged by Bking:
[operations/puppet@production] wdqs: add newly-reimaged hosts as scap targets
Change #1192626 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] wdqs: add newly-reimaged hosts as scap targets
Mentioned in SAL (#wikimedia-operations) [2025-09-30T18:51:11Z] <bking@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T405978, transfer scholarly graph to newly-reimaged host) xfer scholarly_articles from wdqs2023.codfw.wmnet -> wdqs2016.codfw.wmnet w/ force delete existing files, repooling both afterwards
Mentioned in SAL (#wikimedia-operations) [2025-09-30T17:58:50Z] <bking@cumin2002> START - Cookbook sre.wdqs.data-transfer (T405978, transfer scholarly graph to newly-reimaged host) xfer scholarly_articles from wdqs2023.codfw.wmnet -> wdqs2016.codfw.wmnet w/ force delete existing files, repooling both afterwards
Mentioned in SAL (#wikimedia-operations) [2025-09-30T17:58:26Z] <bking@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T405978, transfer scholarly graph to newly-reimaged host) xfer scholarly_articles from wdqs2023.codfw.wmnet -> wdqs2016.codfw.wmnet w/ force delete existing files, repooling both afterwards
Mentioned in SAL (#wikimedia-operations) [2025-09-30T17:58:21Z] <bking@cumin2002> START - Cookbook sre.wdqs.data-transfer (T405978, transfer scholarly graph to newly-reimaged host) xfer scholarly_articles from wdqs2023.codfw.wmnet -> wdqs2016.codfw.wmnet w/ force delete existing files, repooling both afterwards
Mentioned in SAL (#wikimedia-operations) [2025-09-30T17:57:47Z] <bking@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T405978, transfer scholarly graph to newly-reimaged host) xfer scholarly_articles from wdqs2023.codfw.wmnet -> wdqs2016.codfw.wmnet w/ force delete existing files, repooling both afterwards
Mentioned in SAL (#wikimedia-operations) [2025-09-30T17:57:39Z] <bking@cumin2002> START - Cookbook sre.wdqs.data-transfer (T405978, transfer scholarly graph to newly-reimaged host) xfer scholarly_articles from wdqs2023.codfw.wmnet -> wdqs2016.codfw.wmnet w/ force delete existing files, repooling both afterwards
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs2016.codfw.wmnet with OS bullseye executed with errors:
- wdqs2016 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata (7) to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202509301633_bking_3986025_wdqs2016.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wdqs2016.codfw.wmnet" to get a root shell, but depending on the failure this may not work.
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs2016.codfw.wmnet with OS bullseye
sudo cookbook sre.hardware.upgrade-firmware -n -c nic wdqs2017.codfw.wmnet is failing with the error
File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 1072, in run failures += self._run_host(hostname) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 1120, in _run_host if not self.update_driver( ^^^^^^^^^^^^^^^^^^^ File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 943, in update_driv er member = self._get_hw_member(redfish_host, driver_category) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 912, in _get_hw_mem ber return self._filter_network(redfish_host, members) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 885, in _filter_net work if port_data['LinkStatus'].lower() == 'up': ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'lower'
I'm going to try updating its other firmwares first and see what happens.
wdqs201[6-7] have failed their reimages multiple times. I'm applying all outstanding firmware updates to both hosts and will try the reimages again after that.
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1002 for host wdqs2016.codfw.wmnet with OS bullseye executed with errors:
- wdqs2016 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata (7) to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202509292148_bking_159615_wdqs2016.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wdqs2016.codfw.wmnet" to get a root shell, but depending on the failure this may not work.
Mon, Sep 29
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1002 for host wdqs2017.codfw.wmnet with OS bullseye executed with errors:
- wdqs2017 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wdqs2017.codfw.wmnet" to get a root shell, but depending on the failure this may not work.
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1002 for host wdqs2017.codfw.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1002 for host wdqs2016.codfw.wmnet with OS bullseye
Change #1191525 merged by Ryan Kemper:
[operations/puppet@production] wdqs: shift old full graph hosts to new roles
Fri, Sep 26
Change #1191695 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):
[operations/puppet@production] Track airflow-wikidata-ops for offboarding
(base) btullis@barracuda:~$ docker run -it docker-registry.wikimedia.org/repos/data-engineering/spark:3.5.7-2025-09-26-113540-22a5ded19e72b0705bce176096b9becec779cf4e@sha256:d887963947332977d36fee6888aedf101068abce4afac6ba7af0a29a2ead03ce bash Unable to find image 'docker-registry.wikimedia.org/repos/data-engineering/spark:3.5.7-2025-09-26-113540-22a5ded19e72b0705bce176096b9becec779cf4e@sha256:d887963947332977d36fee6888aedf101068abce4afac6ba7af0a29a2ead03ce' locally docker-registry.wikimedia.org/repos/data-engineering/spark@sha256:d887963947332977d36fee6888aedf101068abce4afac6ba7af0a29a2ead03ce: Pulling from repos/data-engineering/spark d62f11b5abe0: Already exists 2b76c2925be8: Already exists 9b108dfdd561: Pull complete 226e4fecf9f8: Pull complete d46a2035a8fc: Pull complete 309f0b5071dc: Pull complete 1f7d4d323250: Pull complete 1d886fccdf97: Pull complete Digest: sha256:d887963947332977d36fee6888aedf101068abce4afac6ba7af0a29a2ead03ce Status: Downloaded newer image for docker-registry.wikimedia.org/repos/data-engineering/spark@sha256:d887963947332977d36fee6888aedf101068abce4afac6ba7af0a29a2ead03ce ++ id -u + myuid=926 ++ id -g + mygid=926 + set +e ++ getent passwd 926 + uidentry=spark:x:926:926::/home/spark:/bin/sh + set -e + '[' -z spark:x:926:926::/home/spark:/bin/sh ']' + '[' -z /usr/lib/jvm/java-8-openjdk-amd64 ']' + SPARK_CLASSPATH=':/opt/spark/jars/*' + env + grep SPARK_JAVA_OPT_ + sort -t_ -k4 -n + sed 's/[^=]*=\(.*\)/\1/g' ++ command -v readarray + '[' readarray ']' + readarray -t SPARK_EXECUTOR_JAVA_OPTS + '[' -n '' ']' + '[' -z ']' + '[' -z ']' + '[' -n /opt/hadoop ']' + '[' -z '' ']' ++ /opt/hadoop/bin/hadoop classpath + export 'SPARK_DIST_CLASSPATH=/opt/hadoop/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/*:/opt/hadoop/share/hadoop/common/*:/opt/hadoop/share/hadoop/hdfs:/opt/hadoop/share/hadoop/hdfs/lib/*:/opt/hadoop/share/hadoop/hdfs/*:/opt/hadoop/share/hadoop/yarn:/opt/hadoop/share/hadoop/yarn/lib/*:/opt/hadoop/share/hadoop/yarn/*:/opt/hadoop/share/hadoop/mapreduce/lib/*:/opt/hadoop/share/hadoop/mapreduce/*:/opt/hadoop/contrib/capacity-scheduler/*.jar' + SPARK_DIST_CLASSPATH='/opt/hadoop/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/*:/opt/hadoop/share/hadoop/common/*:/opt/hadoop/share/hadoop/hdfs:/opt/hadoop/share/hadoop/hdfs/lib/*:/opt/hadoop/share/hadoop/hdfs/*:/opt/hadoop/share/hadoop/yarn:/opt/hadoop/share/hadoop/yarn/lib/*:/opt/hadoop/share/hadoop/yarn/*:/opt/hadoop/share/hadoop/mapreduce/lib/*:/opt/hadoop/share/hadoop/mapreduce/*:/opt/hadoop/contrib/capacity-scheduler/*.jar' + '[' -z ']' + '[' -z ']' + '[' -z x ']' + SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*' + SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*:/var/tmp' + case "$1" in + echo 'Non-spark-on-k8s command provided, proceeding in pass-through mode...' Non-spark-on-k8s command provided, proceeding in pass-through mode... + CMD=("$@") + exec /usr/bin/tini -s -- bash spark@fc8f61613ef5:/var/tmp$
Change #1191137 merged by jenkins-bot:
[operations/deployment-charts@master] Remove our custom spark-operator helm chart
Change #1191137 merged by jenkins-bot:
[operations/deployment-charts@master] Remove our custom spark-operator helm chart
I've created the group with @gmodena as the initial member:
Mentioned in SAL (#wikimedia-operations) [2025-09-26T12:35:43Z] <moritzm> created cn=airflow-wikidata-ops group T405557
Change #1191136 merged by jenkins-bot:
[operations/deployment-charts@master] Remove the existing spark-operator release