User Details
- User Since
- Feb 10 2016, 11:25 AM (506 w, 19 h)
- Availability
- Available
- IRC Nick
- volans
- LDAP User
- Volans
- MediaWiki User
- RCoccioli (WMF) [ Global Accounts ]
Tue, Oct 21
For some related historical context on the lack of parity between the Icinga and AM APIs in Spicerack see also T293209 (look for optimal).
Fri, Oct 10
[note for future self] If we can wait for Python 3.14 to be around in our systems then we should evaluate also the new InterpreterPoolExecutor for this, it might be a good fit.
Mon, Oct 6
You can use SSH_AUTH_SOCK=/run/keyholder/proxy.sock scp [OPTIONS] cumin1002.eqiad.wmnet:/path/... /path/... from cumin1003 for example.
Wed, Sep 24
Transient failure of git pull for the cloud/wmcs-cookbooks repository, self-resolved at the next puppet run.
Tue, Sep 23
@Scott_French sorry for the trouble. The patch that added the timeout to the cookbook's version of the function was added ~1.5 years after the functionality landed in Spicerack and somehow we missed to double check the functionality equivalency when migrating the cookbook to the spicerack's module function. Sorry about that.
I think we can set a reasonable timeout default without the need to make it tunable, at least as a first fix, that could go in at anytime.
I doubt we'll have special needs for specific timeouts though, we're talking about DNS queries, not HTTP requests 😉
Sep 22 2025
Sep 11 2025
Ideally sampled logs would be good enough, depending how complex is the setup to sample them.
If there are no easy options for a real sampling we could also consider alternatives approaches:
- a poor's man sampling playing with log rotation and retention (e.g. rotate often and keep only 1 block every N rotated ones)
- a size-based retention that limits the total size of the logs to a predictable amount (will not help with issues in the past but I guess that most issues where we need the data are live/ongoing)
- deduplicate the logs to increase the signal in the logs
+1 for me
The problem is that there is no evidence in hardware logs and I doubt we'll get any replacement from Dell without them.
I've found this in kern.log/dmesg but nothing in racadm logs (both getsel and lclog):
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786152] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786154] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786155] {1}[Hardware Error]: event severity: corrected
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786156] {1}[Hardware Error]: Error 0, type: corrected
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786156] {1}[Hardware Error]: fru_text: B1
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786157] {1}[Hardware Error]: section_type: memory error
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786157] {1}[Hardware Error]: error_status: 0x0000000000000400
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786158] {1}[Hardware Error]: physical_address: 0x0000000f9ec05180
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786159] {1}[Hardware Error]: node: 1 card: 0 module: 0 rank: 2 bank: 18 device: 6 row: 53592 column: 448
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786160] {1}[Hardware Error]: error_type: 2, single-bit ECC
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786161] {1}[Hardware Error]: DIMM location: B1
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786168] [Firmware Warn]: GHES: Invalid error status block length!
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786185] soft_offline: 0xf9ec05: invalidated
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786192] mce: [Hardware Error]: Machine check events logged
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.786194] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 255: 940000000000009f
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.794205] mce: [Hardware Error]: TSC 0 ADDR f9ec05180
Sep 11 02:33:32 cloudcephosd1041 kernel: [125071.799695] mce: [Hardware Error]: PROCESSOR 0:806f8 TIME 1757558012 SOCKET 0 APIC 0 microcode 2b000639Sep 10 2025
Restart completed
Both ssh and the VM console hangs, without giving any prompt. Forcing a VM restart.
I'm looking into it
Sep 8 2025
This might have been related to the migration to puppet7 and the new puppetdb hosts probably. I can't recall. Resolving as it cannot be reproduced right now AFAICT, feel free to re-open if that's not the case.
It works fine for me:
>>> p.hiera_lookup('cumin1003.eqiad.wmnet', 'profile::puppet::agent::force_puppet7')
DRY-RUN: Executing commands ['puppet lookup --render-as s --compile --node cumin1003.eqiad.wmnet profile::puppet::agent::force_puppet7 2>/dev/null'] on 1 hosts: puppetserver1001.eqiad.wmnet
'true'Aug 28 2025
+1 for me to upgrade them to bookworm for simplicity and to be in sync with the cumin hosts.
Jul 21 2025
Quick update, the cumin hosts are now on bookworm where python3-kubernetes is on v22.6.0, but we still have cumin1002 around on bullseye until the DBA-stuff has been all made compatible with bookworm, see T389380.
@Marostegui sorry I didn't understand that the old host was already unracked or otherwise unreachable also on the management side and I thought my earlier reply was already covering the questions, my bad.
Jul 16 2025
Jul 14 2025
@JMeybohm do you have a specific use case that cannot/is hard to solve simply changing the batch_size of the call to puppet.run()?
https://doc.wikimedia.org/spicerack/master/api/spicerack.puppet.html#spicerack.puppet.PuppetHosts.run
If the new host has a new hostname I think the usual decom template can be used and followed. I don't see any blocker there, if the host is not up and running or the disks were removed the only thing skipped by the decom cookbook will be the disk wipe.
Jul 9 2025
Are we sure that the network card is properly installed? I'm getting this from racadm:
An immediate workaround was implemented in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1167593 that gives us also an idea on what could be useful to expose from spicerack. If a plain KafkaAdminClient just pre-configured with the connection to the current cluster or some more advanced wrapper to extract specific information.
As a starter probably just exposing the KafkaAdminClient could be enough I guess but I'll leave to the kafka experts the decision on this.
I wonder if the prometheus servers have a similar behavior of applying changes from puppet exported resources.
Jul 3 2025
Ack, let's do both: disable it in the bios and skip it in the import
Option 2 LGTM too
Totally agree there is no point. For the idrac one the only potential use case would be to match it with our existing mgmt but probably not.
Jul 2 2025
Jun 27 2025
There are currently changes to remove:
frpig2001.mgmt.frack.codfw.wmnet pay-lvs2001.mgmt.frack.codfw.wmnet pay-lvs2002.mgmt.frack.codfw.wmnet pay-lvs2001.frack.codfw.wmnet pay-lvs2002.frack.codfw.wmnet frban1001.mgmt.frack.eqiad.wmnet frpig2001.frack.codfw.wmnet frban1001.frack.eqiad.wmnet
and their related reverse PTR records.
Are those ok to be removed from the live DNS?
When editing netbox DNS records please always make sure to run the sre.dns.netbox cookbook as otherwise there are pending changes that will block other users and trigger icinga alerts.
Jun 26 2025
But I've tested the scp_dump that was failing and it's fixed. So I think the provision should work. Feel free to try it (from cumin2002 that has the latest version, I will update the others as soon as other testing on other changes is completed)
@Jhancock.wm thanks, I've run the provision cookbook on cp2044 but is giving me authentication credential error.
Jun 25 2025
By trial and error with Luca we found that the Target parameter wants a list now. Sent new fix.
No way, it doesn't work yet, but I need to understand why:
>>> import xml.etree.ElementTree as ET >>> from xml.dom import minidom >>> schema = ET.fromstring(r.request('get', '/redfish/v1/Schemas/OemManager_v1.xml').text) >>> print(ET.tostring(schema).decode()) [...SNIP...] <ns1:Property Name="Target" Type="Collection(Edm.String)" Nullable="false"> <ns1:Annotation Term="OData.Permissions" EnumMember="OData.Permissions/Read" /> <ns1:Annotation Term="OData.Description" String="To identify the component for Export. It identifies the one or more FQDDs. Default = ALL." /> <ns1:Annotation Term="OData.LongDescription" String="This property shall indicate the component(s) for Export. The list of valid values for Target is ALL, IDRAC, BIOS, NIC, RAID, FC, InfiniBand, SupportAssist, EventFilters, System, LifecycleController, AHCI, PCIeSSD etc. FQDD strings are also supported as valid values. Default value is ALL. The action ImportSystemConfigurationPreview accepts only the default value ALL." /> </ns1:Property> [...SNIP...]
Am I looking at the wrong schema?
Jun 24 2025
As a reminder to ourselves, to run django manage.py commands you need to:
$ sudo -i $ export DJANGO_SETTINGS_MODULE=debmonitor.settings.prod $ export DEBMONITOR_CONFIG=/etc/debmonitor/config.json $ python3 /usr/lib/python3/dist-packages/debmonitor/manage.py
Possible alternate format slightly more structured to ensure we read the right keys instead of almost free form.
{
"cluster": "clustername",
"images": [
{
"name": "docker-registry.discovery.wmnet/cert-manager/cainjector:1.10.1-2'",
"namespaces": {
"cert-manager": 2,
....
}
}
]
}This is from localhost real code executed:
Would something like this be ok on the debmonitor side?
Jun 23 2025
Those new servers are of generations 17, that is the first one shipped with iDRAC 10 and a firmware version of 1.20.x.x.
It's Redfish support is slightly different than the previous iDRACs and according to [1] also not yet fully developped.
Looking at the manual [2] and the SCP specific reference guide [3] and based on a quick test on cp2043, I think that just adding the "ShareType" parameter should be enough.
Jun 19 2025
It's present twice in the dump:
$ zgrep wireless_wirelesslin_interface_a_id_bc9e37fd_fk_dcim_inte /srv/psql-all-dbs-latest.sql.gz -- Name: wireless_wirelesslink wireless_wirelesslin_interface_a_id_bc9e37fd_fk_dcim_inte; Type: FK CONSTRAINT; Schema: public; Owner: netbox ADD CONSTRAINT wireless_wirelesslin_interface_a_id_bc9e37fd_fk_dcim_inte FOREIGN KEY (interface_a_id) REFERENCES public.dcim_interface(id) DEFERRABLE INITIALLY DEFERRED; -- Name: wireless_wirelesslink wireless_wirelesslin_interface_a_id_bc9e37fd_fk_dcim_inte; Type: FK CONSTRAINT; Schema: public; Owner: netbox ADD CONSTRAINT wireless_wirelesslin_interface_a_id_bc9e37fd_fk_dcim_inte FOREIGN KEY (interface_a_id) REFERENCES public.dcim_interface(id) DEFERRABLE INITIALLY DEFERRED;
Jun 18 2025
I had to deploy homer to cumin2002 after the upgrade:
sudo cookbook sre.deploy.python-code -r 'Release v0.10.1' homer 'cumin2002*'
Now it works fine.
yes if you pick the same version (option 0 above) it would just tell you that there is nothing to do because already at the same version. Thanks for checking.
The cookbook exited with that code because it had a failure, unfortunately was missing a useful logging message at the right point. I'm adding it in this patch.
Manually checking the disks they both show the correct version, so not 100% sure what happened there, but it could be that they were not (yet?) reporting the new version. If you try to re-run it it does tell you there is nothing to upgrade right?
That's an interesting idea that would work right now because the auto-download from the Dell website is broken, but if we fix that then any cumin host where the cookbook runs will download a new file if asked. I guess we need to decide if we want to work on resurrecting that workflow or not (with all the caveats it has).
Created T397306
@BTullis the SSD upgrade is a type of its own, not STORAGE, so the files must be in /srv/firmware/poweredge-r440/SSD. If you use that path it should just work.
Jun 16 2025
Why is reimaging messing with those addresses at all? @Volans says that it's because of syncing with puppetdb, but I don't see evidence that those addresses were ever referenced in puppet.
Sounds good! Resolving this, happy to discuss further improvements whenever you want.
Jun 12 2025
Changes applied, I had to also run the sudo cookbook sre.dns.wipe-cache db2241.mgmt.codfw.wmnet db2242.mgmt.codfw.wmnet to make sure I was connecting to the same one.
Thanks, I'm checking with @wiki_willy too for the accounting side before proceeding to be sure.
Running the provision cookbook (with the appropriate options for an existing host) might or might not trigger a host reboot based on what configurations are changed. So it might depend on the current status of the host and how much it differs compared to what the provision cookbook expected status is.
I think that this has been a case of serial number swap and it probably "worked" because both hosts were setup at the same time maybe.
FYI db2241 and db2242 have their MGMT DNS inverted, so db2241.mgmt.codfw.wmnet points to db2242 iDRAC and viceversa.
This is very dangerous as operating on one host will actually perform changes on the other, like a reboot, and it happens that db2241 is a master right now.
I've run some custom code with spicerack-shell and get the audit data for the whole fleet and comparing the MAC address retrieved from Redfish with the one in PuppetDB for the primary interface.
@wiki_willy is this something still needed or the current workflow doesn't need it anymore?
The script doesn't exists since long time, replaced by the related cookbook.
Is this still relevant or superseded by more recent development/plans in this area?
Just to clarify expectations here, while SRE-tools is happy to be included in the discussion/design, we think that this request belongs to the specific cookbook owner (observability).
Jun 11 2025
From MariaDB 10.7 according to https://mariadb.com/kb/en/reserved-words/ ;)
Jun 9 2025
I lost my bet, forced the description to be with - in the config and the handler didn't create the task either, this is something more subtle, but the raid handler works fine when called manually.
I've reproed the thing forcing the check to be OK and then letting icinga re-trigger it. The raid handler was triggered but nothing happened.
Nothing special in the logs of icinga and nothing in raid_handler.log.
As a side note sudo /usr/lib/nagios/plugins/check_nrpe -4 -H db2226 -c get_raid_status_broadcom returns an exit status of 2 but it seems to report fine the controller status. The exit status of the get_* should be 0 in this case AFAIK, not the same of the check_*.
From IRC logs the handler was triggered fine:
Jun 07 04:22:14 alert1002 icinga[1502926]: SERVICE EVENT HANDLER: db2226;Dell PowerEdge RAID / Supermicro Broadcom Controller;CRITICAL;HARD;3;raid_handler!broadcom!codfw
Jun 6 2025
Forgot to mention, this is what I used to upgrade just the SSD firmware:
Thanks @RKemper for the depool, I've performed the final run with the current PS in gerrit with test-cookbook for cirrussearch2113. All good.
Jun 5 2025
@bking @RKemper I'm ready with the final test for https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1150728
I see that cirrussearch2113 is pooled so I didn't proceed with it. LMK when a test host will be available, thanks in advance.
Jun 4 2025
Removing SRE I/F as we're not involved in the WMF-Uniq cookie management.
Jun 3 2025
@JAllemandou thanks for the change, weird of superset defaulting to the whole dataset, sigh. From my tests I can surely confirm that the filters load much quicker.
Ideally each filter should depend on all others but that matrix might just be quite painful to do and probably would start making the queries complex because too many conditions.
If the issue didn't happen again since the change I think we can call this resolved.
No objection from my side. I can look if there are other alternative options in addition to those mentioned, but I'm not sure there is any.
May 29 2025
@bking great, thanks a lot. I've already done cirrussearch2112 with my latest version of the patch. I'll do `cirrussearch2113 on monday with hopefully the final version. If you prefer to have cirrussearch2113 in the pool during the weekend feel free to do so and we can ban it again on monday.
@bking as agreed on IRC let me know when another 1~2 hosts are ready for testing so we can complete the change for the cookbook and let everyone upgrade SSDs firmware when needed.
May 28 2025
For those that have investigated this at the start, can someone point me to some dashboard where I can verify the timeline to understand if this is related to any of the changes mentioned above?
Because if so we can adapt them, if not the cause might be elsewhere.
May 27 2025
Yes I think it makes sense to modify BiosNvmeDriver from DellQualifiedDrives to AllDrives when present at this point. Is it ok to leave it as all drives then or do we have any security concern?
May 26 2025
Thank Brian, I've upgraded the firmware of cirrussearch2111 with the above patch, it's all back to you.
The only thing that didn't work was the check of the job result because it was not there:
GET https://10.193.3.47/redfish/v1/TaskService/Tasks/JID_482912952574 returned HTTP 404
The user created in Netbox has username sdeckelmann while the user in LDAP has UID sdeckelmann-wmf. As a result the user in Netbox didn't get the wmf group that grants the RO permissions.
I'm not sure how the current user was created in Netbox and marked as active (that should happen only if it belongs to cn=wmf,ou=groups,dc=wikimedia,dc=org). Was the LDAP entry first created with sdeckelmann and then modified to sdeckelmann-wmf?
May 22 2025
@bking Yeah, no need to go offtopic for something almost 3y old. I have indeed forgot about the Re: Request for NIC firmware update advice email thread, sorry. But unless I'm missing something I don't see in there any mention of a parallel separate approach on a gitlab repository not using cookbooks. And at the time the firmware cookbook was almost ready from what I gather from that email thread. I don't recall the details and possibly you've chatted with John about that more than me given that he was the one working on that project. Anyway, let's look at the future :)

