Page MenuHomePhabricator

Volans (Riccardo Coccioli)
SRE

Today

  • No visible events.

Tomorrow

  • No visible events.

Saturday

  • No visible events.

User Details

User Since
Feb 10 2016, 11:25 AM (506 w, 19 h)
Availability
Available
IRC Nick
volans
LDAP User
Volans
MediaWiki User
RCoccioli (WMF) [ Global Accounts ]

Recent Activity

Tue, Oct 21

Restricted Application added a project to T407787: Alertmanager triggers an alert on IRC and email after the alert has resolved: Infrastructure-Foundations.

For some related historical context on the lack of parity between the Icinga and AM APIs in Spicerack see also T293209 (look for optimal).

Tue, Oct 21, 2:06 PM · Infrastructure-Foundations, Spicerack, SRE-tools, Traffic, Observability-Alerting

Fri, Oct 10

Volans added a comment to T250415: Homer: add parallelization support.

[note for future self] If we can wait for Python 3.14 to be around in our systems then we should evaluate also the new InterpreterPoolExecutor for this, it might be a good fit.

Fri, Oct 10, 9:17 AM · User-Elukey, Infrastructure-Foundations, SRE-tools, homer

Mon, Oct 6

Volans added a comment to T393692: transfer.py fails when handling nftables-configured firewall.

You can use SSH_AUTH_SOCK=/run/keyholder/proxy.sock scp [OPTIONS] cumin1002.eqiad.wmnet:/path/... /path/... from cumin1003 for example.

Mon, Oct 6, 2:58 PM · Patch-For-Review, database-backups, Infrastructure-Foundations
Volans edited P83603 Test custom fact from Gerrit patch.
Mon, Oct 6, 10:54 AM
Volans created P83603 Test custom fact from Gerrit patch.
Mon, Oct 6, 10:53 AM

Wed, Sep 24

Volans closed T405434: PuppetFailure Puppet has failed on cloudcumin1001:9100 as Resolved.

Transient failure of git pull for the cloud/wmcs-cookbooks repository, self-resolved at the next puppet run.

Wed, Sep 24, 7:40 AM · cloud-services-team

Tue, Sep 23

Volans added a comment to T393600: sre.discovery cookbooks: refactor use of resolve_with_client_ip.

@Scott_French sorry for the trouble. The patch that added the timeout to the cookbook's version of the function was added ~1.5 years after the functionality landed in Spicerack and somehow we missed to double check the functionality equivalency when migrating the cookbook to the spicerack's module function. Sorry about that.
I think we can set a reasonable timeout default without the need to make it tunable, at least as a first fix, that could go in at anytime.
I doubt we'll have special needs for specific timeouts though, we're talking about DNS queries, not HTTP requests 😉

Tue, Sep 23, 6:40 PM · serviceops

Sep 22 2025

Volans added a watcher for tools-infrastructure-team: Volans.
Sep 22 2025, 6:29 AM
Volans added a member for tools-infrastructure-team: Volans.
Sep 22 2025, 6:29 AM

Sep 11 2025

Volans added a comment to T404373: Log DNS queries from Cloud VPS clients.

Ideally sampled logs would be good enough, depending how complex is the setup to sample them.
If there are no easy options for a real sampling we could also consider alternatives approaches:

  • a poor's man sampling playing with log rotation and retention (e.g. rotate often and keep only 1 block every N rotated ones)
  • a size-based retention that limits the total size of the logs to a predictable amount (will not help with issues in the past but I guess that most issues where we need the data are live/ongoing)
  • deduplicate the logs to increase the signal in the logs
Sep 11 2025, 4:21 PM · Patch-For-Review, cloud-services-team, Cloud-VPS
Volans added a comment to T404300: Remove KernelErrors alerts.

+1 for me

Sep 11 2025, 9:06 AM · cloud-services-team (FY2025/26-Q1)
Volans added a comment to T404282: KernelErrors Server cloudcephosd1041 logged kernel errors.

The problem is that there is no evidence in hardware logs and I doubt we'll get any replacement from Dell without them.

Sep 11 2025, 8:39 AM · cloud-services-team
Volans triaged T404282: KernelErrors Server cloudcephosd1041 logged kernel errors as Medium priority.
Sep 11 2025, 7:52 AM · cloud-services-team
Volans added a comment to T404282: KernelErrors Server cloudcephosd1041 logged kernel errors.

I've found this in kern.log/dmesg but nothing in racadm logs (both getsel and lclog):

Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786152] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786154] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786155] {1}[Hardware Error]: event severity: corrected
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786156] {1}[Hardware Error]:  Error 0, type: corrected
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786156] {1}[Hardware Error]:  fru_text: B1
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786157] {1}[Hardware Error]:   section_type: memory error
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786157] {1}[Hardware Error]:   error_status: 0x0000000000000400
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786158] {1}[Hardware Error]:   physical_address: 0x0000000f9ec05180
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786159] {1}[Hardware Error]:   node: 1 card: 0 module: 0 rank: 2 bank: 18 device: 6 row: 53592 column: 448
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786160] {1}[Hardware Error]:   error_type: 2, single-bit ECC
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786161] {1}[Hardware Error]:   DIMM location:  B1
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786168] [Firmware Warn]: GHES: Invalid error status block length!
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786185] soft_offline: 0xf9ec05: invalidated
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786192] mce: [Hardware Error]: Machine check events logged
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786194] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 255: 940000000000009f
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.794205] mce: [Hardware Error]: TSC 0 ADDR f9ec05180
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.799695] mce: [Hardware Error]: PROCESSOR 0:806f8 TIME 1757558012 SOCKET 0 APIC 0 microcode 2b000639
Sep 11 2025, 7:52 AM · cloud-services-team

Sep 10 2025

Volans added a comment to T404163: CodeSearch is unresponsive (2025-09-10).

Restart completed

Sep 10 2025, 9:13 AM · VPS-project-Codesearch
Volans added a comment to T404163: CodeSearch is unresponsive (2025-09-10).

Both ssh and the VM console hangs, without giving any prompt. Forcing a VM restart.

Sep 10 2025, 8:54 AM · VPS-project-Codesearch
Volans claimed T404163: CodeSearch is unresponsive (2025-09-10).

I'm looking into it

Sep 10 2025, 8:51 AM · VPS-project-Codesearch
Volans added a watcher for cloud-services-team: Volans.
Sep 10 2025, 8:05 AM
Volans added a member for cloud-services-team: Volans.
Sep 10 2025, 8:05 AM

Sep 8 2025

Volans closed T378331: Puppet module hiera_lookup not working as Resolved.

This might have been related to the migration to puppet7 and the new puppetdb hosts probably. I can't recall. Resolving as it cannot be reproduced right now AFAICT, feel free to re-open if that's not the case.

Sep 8 2025, 3:05 PM · Infrastructure-Foundations, SRE-tools, Spicerack
Volans added a comment to T378331: Puppet module hiera_lookup not working.

It works fine for me:

>>> p.hiera_lookup('cumin1003.eqiad.wmnet', 'profile::puppet::agent::force_puppet7')
DRY-RUN: Executing commands ['puppet lookup --render-as s --compile --node cumin1003.eqiad.wmnet profile::puppet::agent::force_puppet7 2>/dev/null'] on 1 hosts: puppetserver1001.eqiad.wmnet
'true'
Sep 8 2025, 2:57 PM · Infrastructure-Foundations, SRE-tools, Spicerack

Aug 28 2025

Volans added a comment to T403153: Upgrade cloudcumin hosts to bookworm/trixie.

+1 for me to upgrade them to bookworm for simplicity and to be in sync with the cumin hosts.

Aug 28 2025, 12:26 PM · Cloud-VPS, cloud-services-team

Jul 21 2025

Volans added a comment to T388874: Update Kubernetes library version in spicerack.

Quick update, the cumin hosts are now on bookworm where python3-kubernetes is on v22.6.0, but we still have cumin1002 around on bullseye until the DBA-stuff has been all made compatible with bookworm, see T389380.

Jul 21 2025, 3:54 PM · serviceops, Datacenter-Switchover
Volans added a comment to T399449: decommission db1246.eqiad.wmnet.

@Marostegui sorry I didn't understand that the old host was already unracked or otherwise unreachable also on the management side and I thought my earlier reply was already covering the questions, my bad.

Jul 21 2025, 9:04 AM · SRE, DC-Ops, ops-eqiad, DBA, decommission-hardware

Jul 16 2025

Volans closed T341973: Spicerack: add distributed locking support as Resolved.
Jul 16 2025, 10:47 AM · Patch-For-Review, Infrastructure-Foundations, SRE-tools, Spicerack

Jul 14 2025

Volans added a comment to T397687: Increase the default batch size of puppet.run().

@JMeybohm do you have a specific use case that cannot/is hard to solve simply changing the batch_size of the call to puppet.run()?
https://doc.wikimedia.org/spicerack/master/api/spicerack.puppet.html#spicerack.puppet.PuppetHosts.run

Jul 14 2025, 2:52 PM · Infrastructure-Foundations, SRE-tools, Spicerack
Volans added a comment to T399449: decommission db1246.eqiad.wmnet.

If the new host has a new hostname I think the usual decom template can be used and followed. I don't see any blocker there, if the host is not up and running or the disks were removed the only thing skipped by the decom cookbook will be the disk wipe.

Jul 14 2025, 12:29 PM · SRE, DC-Ops, ops-eqiad, DBA, decommission-hardware

Jul 9 2025

Volans added a comment to T392851: Q4:rack/setup/install cp20[43-58] codfw.

Are we sure that the network card is properly installed? I'm getting this from racadm:

Jul 9 2025, 4:33 PM · User-Elukey, SRE, Patch-For-Review, Traffic, ops-codfw, DC-Ops
Volans added a comment to T399069: Proposal: adding a kafka admin client to spicerack.

An immediate workaround was implemented in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1167593 that gives us also an idea on what could be useful to expose from spicerack. If a plain KafkaAdminClient just pre-configured with the connection to the current cluster or some more advanced wrapper to extract specific information.
As a starter probably just exposing the KafkaAdminClient could be enough I guess but I'll leave to the kafka experts the decision on this.

Jul 9 2025, 1:40 PM · Data-Platform-SRE (2025.07.26 - 2025.08.15), Infrastructure-Foundations, SRE-tools, Spicerack
Volans added a project to T398444: More frequent Puppet runs on the alert hosts?: SRE-tools.

I wonder if the prometheus servers have a similar behavior of applying changes from puppet exported resources.

Jul 9 2025, 9:42 AM · Infrastructure-Foundations, SRE-tools, SRE Observability (FY2025/2026-Q1)

Jul 3 2025

Volans added a comment to T398464: Netbox: PupeptDB Import - ignore 'vxlan' and 'openvswitch' interfaces without IPs.

Ack, let's do both: disable it in the bios and skip it in the import

Jul 3 2025, 10:23 AM · Infrastructure-Foundations, SRE
Volans created T398605: Prometheus puppettization has a very large directory.
Jul 3 2025, 10:19 AM · Observability-Metrics, observability
Volans added a comment to T398412: Decom cookbook: delete virtual interfaces from device.

Option 2 LGTM too

Jul 3 2025, 10:16 AM · Patch-For-Review, netbox, netops, Infrastructure-Foundations, SRE
Volans added a comment to T398464: Netbox: PupeptDB Import - ignore 'vxlan' and 'openvswitch' interfaces without IPs.

Totally agree there is no point. For the idrac one the only potential use case would be to match it with our existing mgmt but probably not.

Jul 3 2025, 10:16 AM · Infrastructure-Foundations, SRE

Jul 2 2025

Volans triaged T397696: I/F hackathon June 2025: Add kubernetes support to Debmonitor as Medium priority.
Jul 2 2025, 9:54 AM · Patch-For-Review, Infrastructure-Foundations

Jun 27 2025

Volans updated subscribers of T396396: decommission cloudcontrol2004-dev.codfw.wmnet.
Jun 27 2025, 2:07 PM · SRE, DC-Ops, ops-codfw, decommission-hardware, cloud-services-team
Volans added a comment to T397868: Decommission frack hosts: frpig2001 pay-lvs2001 pay-lvs2002.

There are currently changes to remove:

frpig2001.mgmt.frack.codfw.wmnet
pay-lvs2001.mgmt.frack.codfw.wmnet
pay-lvs2002.mgmt.frack.codfw.wmnet
pay-lvs2001.frack.codfw.wmnet
pay-lvs2002.frack.codfw.wmnet
frban1001.mgmt.frack.eqiad.wmnet
frpig2001.frack.codfw.wmnet
frban1001.frack.eqiad.wmnet

and their related reverse PTR records.
Are those ok to be removed from the live DNS?

Jun 27 2025, 7:13 AM · SRE, DC-Ops, ops-codfw, decommission-hardware, fundraising-tech-ops
Volans added a comment to T397868: Decommission frack hosts: frpig2001 pay-lvs2001 pay-lvs2002.

When editing netbox DNS records please always make sure to run the sre.dns.netbox cookbook as otherwise there are pending changes that will block other users and trigger icinga alerts.

Jun 27 2025, 7:11 AM · SRE, DC-Ops, ops-codfw, decommission-hardware, fundraising-tech-ops

Jun 26 2025

Volans added a comment to T392851: Q4:rack/setup/install cp20[43-58] codfw.

But I've tested the scp_dump that was failing and it's fixed. So I think the provision should work. Feel free to try it (from cumin2002 that has the latest version, I will update the others as soon as other testing on other changes is completed)

Jun 26 2025, 9:12 AM · User-Elukey, SRE, Patch-For-Review, Traffic, ops-codfw, DC-Ops
Volans added a comment to T392851: Q4:rack/setup/install cp20[43-58] codfw.

@Jhancock.wm thanks, I've run the provision cookbook on cp2044 but is giving me authentication credential error.

Jun 26 2025, 9:10 AM · User-Elukey, SRE, Patch-For-Review, Traffic, ops-codfw, DC-Ops

Jun 25 2025

Volans added a comment to T392851: Q4:rack/setup/install cp20[43-58] codfw.

By trial and error with Luca we found that the Target parameter wants a list now. Sent new fix.

Jun 25 2025, 2:05 PM · User-Elukey, SRE, Patch-For-Review, Traffic, ops-codfw, DC-Ops
Volans added a comment to T392851: Q4:rack/setup/install cp20[43-58] codfw.

No way, it doesn't work yet, but I need to understand why:

>>> import xml.etree.ElementTree as ET
>>> from xml.dom import minidom
>>> schema = ET.fromstring(r.request('get', '/redfish/v1/Schemas/OemManager_v1.xml').text)
>>> print(ET.tostring(schema).decode())
[...SNIP...]
        <ns1:Property Name="Target" Type="Collection(Edm.String)" Nullable="false">
          <ns1:Annotation Term="OData.Permissions" EnumMember="OData.Permissions/Read" />
          <ns1:Annotation Term="OData.Description" String="To identify the component for Export. It identifies the one or more FQDDs. Default = ALL." />
          <ns1:Annotation Term="OData.LongDescription" String="This property shall indicate the component(s) for Export. The list of valid values for Target is ALL, IDRAC, BIOS, NIC, RAID, FC, InfiniBand, SupportAssist, EventFilters, System, LifecycleController, AHCI, PCIeSSD etc. FQDD strings are also supported as valid values. Default value is ALL. The action ImportSystemConfigurationPreview accepts only the default value ALL." />
        </ns1:Property>
[...SNIP...]

Am I looking at the wrong schema?

Jun 25 2025, 12:34 PM · User-Elukey, SRE, Patch-For-Review, Traffic, ops-codfw, DC-Ops

Jun 24 2025

Volans added a comment to T397696: I/F hackathon June 2025: Add kubernetes support to Debmonitor.

As a reminder to ourselves, to run django manage.py commands you need to:

$ sudo -i
$ export DJANGO_SETTINGS_MODULE=debmonitor.settings.prod
$ export DEBMONITOR_CONFIG=/etc/debmonitor/config.json
$ python3 /usr/lib/python3/dist-packages/debmonitor/manage.py
Jun 24 2025, 3:49 PM · Patch-For-Review, Infrastructure-Foundations
Volans added a comment to P78665 (An Untitled Masterwork).

Possible alternate format slightly more structured to ensure we read the right keys instead of almost free form.

{
  "cluster": "clustername",
  "images": [
    {
      "name": "docker-registry.discovery.wmnet/cert-manager/cainjector:1.10.1-2'",
      "namespaces": {
          "cert-manager": 2,
           ....
      }
    } 
  ]
}
Jun 24 2025, 3:25 PM
Volans added a comment to P78665 (An Untitled Masterwork).

This is from localhost real code executed:

image.png (812×2 px, 169 KB)

Jun 24 2025, 1:17 PM
Volans added a comment to P78665 (An Untitled Masterwork).

Would something like this be ok on the debmonitor side?

image.png (482×2 px, 123 KB)

Jun 24 2025, 10:23 AM

Jun 23 2025

Volans added a comment to T392851: Q4:rack/setup/install cp20[43-58] codfw.

Those new servers are of generations 17, that is the first one shipped with iDRAC 10 and a firmware version of 1.20.x.x.
It's Redfish support is slightly different than the previous iDRACs and according to [1] also not yet fully developped.
Looking at the manual [2] and the SCP specific reference guide [3] and based on a quick test on cp2043, I think that just adding the "ShareType" parameter should be enough.

Jun 23 2025, 5:36 PM · User-Elukey, SRE, Patch-For-Review, Traffic, ops-codfw, DC-Ops

Jun 19 2025

Volans added a comment to T397300: Upgrade Netbox to version 4.0.11.

It's present twice in the dump:

$ zgrep wireless_wirelesslin_interface_a_id_bc9e37fd_fk_dcim_inte /srv/psql-all-dbs-latest.sql.gz
-- Name: wireless_wirelesslink wireless_wirelesslin_interface_a_id_bc9e37fd_fk_dcim_inte; Type: FK CONSTRAINT; Schema: public; Owner: netbox
    ADD CONSTRAINT wireless_wirelesslin_interface_a_id_bc9e37fd_fk_dcim_inte FOREIGN KEY (interface_a_id) REFERENCES public.dcim_interface(id) DEFERRABLE INITIALLY DEFERRED;
-- Name: wireless_wirelesslink wireless_wirelesslin_interface_a_id_bc9e37fd_fk_dcim_inte; Type: FK CONSTRAINT; Schema: public; Owner: netbox
    ADD CONSTRAINT wireless_wirelesslin_interface_a_id_bc9e37fd_fk_dcim_inte FOREIGN KEY (interface_a_id) REFERENCES public.dcim_interface(id) DEFERRABLE INITIALLY DEFERRED;
Jun 19 2025, 7:12 AM · Infrastructure-Foundations, netbox

Jun 18 2025

Volans added a comment to T389380: Upgrade Cumin hosts to Bookworm.

I had to deploy homer to cumin2002 after the upgrade:

sudo cookbook sre.deploy.python-code -r 'Release v0.10.1' homer 'cumin2002*'

Now it works fine.

Jun 18 2025, 8:45 PM · Patch-For-Review, Infrastructure-Foundations
Volans added a comment to T394543: SSD firmware update not working in firmware cookbook.

yes if you pick the same version (option 0 above) it would just tell you that there is nothing to do because already at the same version. Thanks for checking.

Jun 18 2025, 4:13 PM · SRE-tools, Infrastructure-Foundations, DC-Ops
Volans added a comment to T394543: SSD firmware update not working in firmware cookbook.

The cookbook exited with that code because it had a failure, unfortunately was missing a useful logging message at the right point. I'm adding it in this patch.
Manually checking the disks they both show the correct version, so not 100% sure what happened there, but it could be that they were not (yet?) reporting the new version. If you try to re-run it it does tell you there is nothing to upgrade right?

Jun 18 2025, 11:55 AM · SRE-tools, Infrastructure-Foundations, DC-Ops
Volans added a comment to T397306: Sync firmwares directory between the cumin hosts.

That's an interesting idea that would work right now because the auto-download from the Dell website is broken, but if we fix that then any cumin host where the cookbook runs will download a new file if asked. I guess we need to decide if we want to work on resurrecting that workflow or not (with all the caveats it has).

Jun 18 2025, 10:49 AM · DC-Ops, SRE-tools, Infrastructure-Foundations
Volans added a comment to T394543: SSD firmware update not working in firmware cookbook.

Created T397306

Jun 18 2025, 10:37 AM · SRE-tools, Infrastructure-Foundations, DC-Ops
Volans triaged T397306: Sync firmwares directory between the cumin hosts as Medium priority.
Jun 18 2025, 10:37 AM · DC-Ops, SRE-tools, Infrastructure-Foundations
Volans added a comment to T394543: SSD firmware update not working in firmware cookbook.

@BTullis the SSD upgrade is a type of its own, not STORAGE, so the files must be in /srv/firmware/poweredge-r440/SSD. If you use that path it should just work.

Jun 18 2025, 10:32 AM · SRE-tools, Infrastructure-Foundations, DC-Ops

Jun 16 2025

Volans added a comment to T396940: cloudcephosd1xxxx.private.eqiad.wikimedia.cloud.

Why is reimaging messing with those addresses at all? @Volans says that it's because of syncing with puppetdb, but I don't see evidence that those addresses were ever referenced in puppet.

Jun 16 2025, 6:44 AM · cloud-services-team, Cloud-VPS, Goal, Cloud-Services-Worktype-Maintenance, Cloud-Services-Origin-Team, User-dcaro
Volans closed T244315: decommission cookbook: add support for decom spreadsheet as Resolved.

Sounds good! Resolving this, happy to discuss further improvements whenever you want.

Jun 16 2025, 6:26 AM · Infrastructure-Foundations, SRE-tools

Jun 12 2025

Volans closed T379757: Q2:rack/setup/install db224[12] as Resolved.

Changes applied, I had to also run the sudo cookbook sre.dns.wipe-cache db2241.mgmt.codfw.wmnet db2242.mgmt.codfw.wmnet to make sure I was connecting to the same one.

Jun 12 2025, 4:53 PM · SRE, Data-Persistence, ops-codfw, DC-Ops
Volans updated subscribers of T379757: Q2:rack/setup/install db224[12].

Thanks, I'm checking with @wiki_willy too for the accounting side before proceeding to be sure.

Jun 12 2025, 4:05 PM · SRE, Data-Persistence, ops-codfw, DC-Ops
Volans added a comment to T396717: Fix PXE miss-configurations.

Running the provision cookbook (with the appropriate options for an existing host) might or might not trigger a host reboot based on what configurations are changed. So it might depend on the current status of the host and how much it differs compared to what the provision cookbook expected status is.

Jun 12 2025, 3:44 PM · SRE, ops-eqiad, ops-codfw, DC-Ops
Volans added a comment to T379757: Q2:rack/setup/install db224[12].

I think that this has been a case of serial number swap and it probably "worked" because both hosts were setup at the same time maybe.

Jun 12 2025, 11:22 AM · SRE, Data-Persistence, ops-codfw, DC-Ops
Volans reopened T379757: Q2:rack/setup/install db224[12] as "Open".

FYI db2241 and db2242 have their MGMT DNS inverted, so db2241.mgmt.codfw.wmnet points to db2242 iDRAC and viceversa.
This is very dangerous as operating on one host will actually perform changes on the other, like a reboot, and it happens that db2241 is a master right now.

Jun 12 2025, 9:13 AM · SRE, Data-Persistence, ops-codfw, DC-Ops
Volans edited P77792 PXE NIC MAC address retrieval audit.
Jun 12 2025, 9:02 AM
Volans edited P77792 PXE NIC MAC address retrieval audit.
Jun 12 2025, 8:59 AM
Volans edited P77792 PXE NIC MAC address retrieval audit.
Jun 12 2025, 8:19 AM
Volans added a comment to T396712: Evaluate automatic MAC-based DHCP for production servers.

I've run some custom code with spicerack-shell and get the audit data for the whole fleet and comparing the MAC address retrieved from Redfish with the one in PuppetDB for the primary interface.

Jun 12 2025, 8:01 AM · Infrastructure-Foundations, netops, SRE-tools
Volans edited P77792 PXE NIC MAC address retrieval audit.
Jun 12 2025, 8:00 AM
Volans created P77792 PXE NIC MAC address retrieval audit.
Jun 12 2025, 7:58 AM
Volans triaged T396712: Evaluate automatic MAC-based DHCP for production servers as Medium priority.
Jun 12 2025, 7:57 AM · Infrastructure-Foundations, netops, SRE-tools
Volans added a comment to T244315: decommission cookbook: add support for decom spreadsheet.

@wiki_willy is this something still needed or the current workflow doesn't need it anymore?

Jun 12 2025, 7:21 AM · Infrastructure-Foundations, SRE-tools
Volans moved T239392: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. from Backlog to Radar on the SRE-tools board.
Jun 12 2025, 7:19 AM · Infrastructure-Foundations, SRE, serviceops, SRE-tools, PyBal
Volans closed T206448: Decommission script race condition as Declined.

The script doesn't exists since long time, replaced by the related cookbook.

Jun 12 2025, 7:18 AM · Infrastructure-Foundations, SRE, SRE-tools
Volans added a comment to T200306: Improve database master switchover script.

Is this still relevant or superseded by more recent development/plans in this area?

Jun 12 2025, 7:16 AM · Infrastructure-Foundations, SRE-tools, DBA
Volans moved T163365: Switchdc RO/RW: add check to test it editing a real wiki from Backlog to Radar on the SRE-tools board.
Jun 12 2025, 7:14 AM · Infrastructure-Foundations, serviceops, SRE-tools
Volans moved T395032: Cookbook sre.hosts.remove_downtime does not remove silences from Backlog to Radar on the SRE-tools board.
Jun 12 2025, 7:11 AM · SRE Observability (FY2025/2026-Q1), Observability-Alerting, SRE-tools
Volans added a comment to T395032: Cookbook sre.hosts.remove_downtime does not remove silences.

Just to clarify expectations here, while SRE-tools is happy to be included in the discussion/design, we think that this request belongs to the specific cookbook owner (observability).

Jun 12 2025, 7:10 AM · SRE Observability (FY2025/2026-Q1), Observability-Alerting, SRE-tools
Volans removed a project from T393692: transfer.py fails when handling nftables-configured firewall: SRE-tools.
Jun 12 2025, 7:07 AM · Patch-For-Review, database-backups, Infrastructure-Foundations
Volans moved T336485: Setup zero touch provisioning (ZTP) for network devices from In Progress to Backlog on the SRE-tools board.
Jun 12 2025, 7:06 AM · Patch-For-Review, SRE, Infrastructure-Foundations, netops, SRE-tools
Volans moved T319277: wait_for_optimal() should ignore acked alerts from In Progress to Backlog on the SRE-tools board.
Jun 12 2025, 7:06 AM · Infrastructure-Foundations, Spicerack, SRE-tools

Jun 11 2025

Volans added a comment to T394372: Migrate clouddb* hosts to MariaDB 10.11.

From MariaDB 10.7 according to https://mariadb.com/kb/en/reserved-words/ ;)

Jun 11 2025, 12:55 PM · cloud-services-team (FY2024/2025-Q3-Q4), Data-Services, Data-Persistence

Jun 9 2025

Volans added a project to T396319: Raid handler for broadcom disk didn't automatically open task on db2226: Observability-Alerting.

I lost my bet, forced the description to be with - in the config and the handler didn't create the task either, this is something more subtle, but the raid handler works fine when called manually.

Jun 9 2025, 7:30 AM · Observability-Alerting, Data-Persistence, Infrastructure-Foundations, SRE-tools
Volans added a comment to T396319: Raid handler for broadcom disk didn't automatically open task on db2226.

I've reproed the thing forcing the check to be OK and then letting icinga re-trigger it. The raid handler was triggered but nothing happened.
Nothing special in the logs of icinga and nothing in raid_handler.log.

Jun 9 2025, 7:13 AM · Observability-Alerting, Data-Persistence, Infrastructure-Foundations, SRE-tools
Volans added a comment to T396319: Raid handler for broadcom disk didn't automatically open task on db2226.

As a side note sudo /usr/lib/nagios/plugins/check_nrpe -4 -H db2226 -c get_raid_status_broadcom returns an exit status of 2 but it seems to report fine the controller status. The exit status of the get_* should be 0 in this case AFAIK, not the same of the check_*.

Jun 9 2025, 6:58 AM · Observability-Alerting, Data-Persistence, Infrastructure-Foundations, SRE-tools
Volans added a comment to T396319: Raid handler for broadcom disk didn't automatically open task on db2226.

From IRC logs the handler was triggered fine:

Jun 07 04:22:14 alert1002 icinga[1502926]: SERVICE EVENT HANDLER: db2226;Dell PowerEdge RAID / Supermicro Broadcom Controller;CRITICAL;HARD;3;raid_handler!broadcom!codfw
Jun 9 2025, 6:47 AM · Observability-Alerting, Data-Persistence, Infrastructure-Foundations, SRE-tools
Volans renamed T396319: Raid handler for broadcom disk didn't automatically open task on db2226 from Broken disk on db2226 to Raid handler for broadcom disk did't automatically open task on db2226.
Jun 9 2025, 6:46 AM · Observability-Alerting, Data-Persistence, Infrastructure-Foundations, SRE-tools

Jun 6 2025

Volans added a comment to T394543: SSD firmware update not working in firmware cookbook.

Forgot to mention, this is what I used to upgrade just the SSD firmware:

Jun 6 2025, 8:56 AM · Infrastructure-Foundations, SRE-tools, DC-Ops
Volans added a comment to T394543: SSD firmware update not working in firmware cookbook.

Thanks @RKemper for the depool, I've performed the final run with the current PS in gerrit with test-cookbook for cirrussearch2113. All good.

Jun 6 2025, 8:25 AM · Infrastructure-Foundations, SRE-tools, DC-Ops

Jun 5 2025

Volans added a comment to T394543: SSD firmware update not working in firmware cookbook.

@bking @RKemper I'm ready with the final test for https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1150728
I see that cirrussearch2113 is pooled so I didn't proceed with it. LMK when a test host will be available, thanks in advance.

Jun 5 2025, 7:59 PM · Infrastructure-Foundations, SRE-tools, DC-Ops

Jun 4 2025

Volans removed a project from T395958: Cookie “WMF-Uniq” has been rejected because it is in a cross-site context: Infrastructure-Foundations.

Removing SRE I/F as we're not involved in the WMF-Uniq cookie management.

Jun 4 2025, 8:17 AM · Experimentation Lab, Regression, xLab, Traffic

Jun 3 2025

Volans added a comment to T393097: Frequent filter timeouts in superset UI.

@JAllemandou thanks for the change, weird of superset defaulting to the whole dataset, sigh. From my tests I can surely confirm that the filters load much quicker.
Ideally each filter should depend on all others but that matrix might just be quite painful to do and probably would start making the queries complex because too many conditions.
If the issue didn't happen again since the change I think we can call this resolved.

Jun 3 2025, 2:35 PM · Data-Platform-SRE (2025.05.24 - 2025.06.13), superset.wikimedia.org
Volans added a comment to T395555: Homer: stop using the 'section' macro in jinja templates.

No objection from my side. I can look if there are other alternative options in addition to those mentioned, but I'm not sure there is any.

Jun 3 2025, 9:15 AM · Infrastructure-Foundations, netops, SRE

May 29 2025

Volans added a comment to T394543: SSD firmware update not working in firmware cookbook.

@bking great, thanks a lot. I've already done cirrussearch2112 with my latest version of the patch. I'll do `cirrussearch2113 on monday with hopefully the final version. If you prefer to have cirrussearch2113 in the pool during the weekend feel free to do so and we can ban it again on monday.

May 29 2025, 5:27 PM · Infrastructure-Foundations, SRE-tools, DC-Ops
Volans added a comment to T394543: SSD firmware update not working in firmware cookbook.

@bking as agreed on IRC let me know when another 1~2 hosts are ready for testing so we can complete the change for the cookbook and let everyone upgrade SSDs firmware when needed.

May 29 2025, 4:15 PM · Infrastructure-Foundations, SRE-tools, DC-Ops
Volans added a project to T395553: ircecho (icinga-wm) doesn't automatically restart if not connected: Sustainability (Incident Followup).
May 29 2025, 10:46 AM · SRE Observability (FY2025/2026-Q1), Observability-Alerting, Sustainability (Incident Followup)
Volans created T395553: ircecho (icinga-wm) doesn't automatically restart if not connected.
May 29 2025, 10:46 AM · SRE Observability (FY2025/2026-Q1), Observability-Alerting, Sustainability (Incident Followup)

May 28 2025

Volans added a comment to T393097: Frequent filter timeouts in superset UI.

For those that have investigated this at the start, can someone point me to some dashboard where I can verify the timeline to understand if this is related to any of the changes mentioned above?
Because if so we can adapt them, if not the cause might be elsewhere.

May 28 2025, 8:05 AM · Data-Platform-SRE (2025.05.24 - 2025.06.13), superset.wikimedia.org

May 27 2025

Volans added a comment to T392844: Q4:rack/setup/install apus-be1004.

Yes I think it makes sense to modify BiosNvmeDriver from DellQualifiedDrives to AllDrives when present at this point. Is it ok to leave it as all drives then or do we have any security concern?

May 27 2025, 1:39 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops

May 26 2025

Volans added a comment to T394543: SSD firmware update not working in firmware cookbook.

Thank Brian, I've upgraded the firmware of cirrussearch2111 with the above patch, it's all back to you.
The only thing that didn't work was the check of the job result because it was not there:

GET https://10.193.3.47/redfish/v1/TaskService/Tasks/JID_482912952574 returned HTTP 404
May 26 2025, 4:16 PM · Infrastructure-Foundations, SRE-tools, DC-Ops
Volans updated subscribers of T395172: Selena can't see objects in Netbox despite having wmf group membership.

The user created in Netbox has username sdeckelmann while the user in LDAP has UID sdeckelmann-wmf. As a result the user in Netbox didn't get the wmf group that grants the RO permissions.
I'm not sure how the current user was created in Netbox and marked as active (that should happen only if it belongs to cn=wmf,ou=groups,dc=wikimedia,dc=org). Was the LDAP entry first created with sdeckelmann and then modified to sdeckelmann-wmf?

May 26 2025, 6:15 AM · Infrastructure-Foundations, netbox, SRE, SRE-Access-Requests

May 22 2025

Volans added a comment to T394543: SSD firmware update not working in firmware cookbook.

@bking Yeah, no need to go offtopic for something almost 3y old. I have indeed forgot about the Re: Request for NIC firmware update advice email thread, sorry. But unless I'm missing something I don't see in there any mention of a parallel separate approach on a gitlab repository not using cookbooks. And at the time the firmware cookbook was almost ready from what I gather from that email thread. I don't recall the details and possibly you've chatted with John about that more than me given that he was the one working on that project. Anyway, let's look at the future :)

May 22 2025, 4:36 PM · Infrastructure-Foundations, SRE-tools, DC-Ops