User Details
- User Since
- Nov 2 2020, 11:59 AM (250 w, 4 d)
- Availability
- Available
- IRC Nick
- dcaro
- LDAP User
- David Caro
- MediaWiki User
- DCaro (WMF) [ Global Accounts ]
Yesterday
A simpler option is also doing the queueing on the components-api side, that's probably easier too right now (and does not prevent the rest of solutions), I'll create a subtask for that.
After a live discussion, we agreed to the above with the following minor changes:
I think we can try updating the 'latest-versions' builder image to add that support, I still have to do a full battery of tests and such.
If you have a reproducer would be great, if not, yes, if it happens again and you can leave it for us to inspect would be great, otherwise something like kubectl get pod deployment/itwiki-draftbot-continuous -o yaml, kubectl describe pod deployment/itwiki-draftbot-continuous -o yaml and kubectl get events -o yaml might be helpful for post-debugging.
Note that the osd was actually added and it's getting data in, but it did not clear the osd flags,
Manually zapping /dev/sdb on cloudcephosd1004, as the depool_and_destroy cookbook did not do it (see T402515: [cookbook,ceph] depool_and_destroy ceph cookbook failed to destroy a single osd):
root@cloudcephosd1004:~# ls -la /var/lib/ceph/osd/ceph-66/block lrwxrwxrwx 1 ceph ceph 93 Aug 21 03:30 /var/lib/ceph/osd/ceph-66/block -> /dev/ceph-62e49003-b3e0-4ecb-acbc-b82348164434/osd-block-06f40a8e-5b3c-4478-af57-739e819bddee
ceph-osd@69 came up ok too, only 66 is left down
ceph-osd@68 came up ok
ceph-osd@67 came up ok
Hmm.... before crashing, it starts checking old peers:
Aug 21 09:55:36 cloudcephosd1004 ceph-osd[173450]: 2025-08-21T09:55:36.717+0000 7fcf28c85700 1 osd.66 pg_epoch: 72700590 pg[8.86b( v 72700589'93196338 (72700063'93192576,72700589'93196338] local-lis/les=72700588/72700589 n=4992 ec=25427536/271059 lis/c=72700588/72700588 les/c/f=72700589/72700589/0 sis=72700590) [66,161,319]/[66,319,117] r=0 lpr=72700590 pi=[72700588,72700590)/1 crt=72700589'93196338 lcod 0'0 mlcod 0'0 remapped mbc={}] start_peering_interval up [66,319] -> [66,161,319], acting [66,319,117] -> [66,319,117], acting_primary 66 -> 66, up_primary 66 -> 66, role 0 -> 0, features acting 4540138312169291775 upacting 4540138290693341183
Still failing, the time it takes is Aug 21 09:54:26 cloudcephosd1004 systemd[1]: ceph-osd@66.service: Consumed 57.214s CPU time.
I have tried extending the systemd unit start timeout to 5 min, see if that helps, though I think it's not getting to the 1m30s default :fingerscrossed:
hmm... I wonder if it has some cache of the cloudcephosd1042 in the old v14 version, and when checking the check_prior_readable_down_osds it finds that it was that old version and crashes? (and at some point that cache gets cleared and then starts)
first try to start ceph-osd@66 failed with the same error:
Aug 21 09:44:47 cloudcephosd1004 ceph-osd[168404]: ceph-osd: ./src/osd/PeeringState.cc:1255: bool PeeringState::check_prior_readable_down_osds(const OSDMapRef&): Assertion `HAVE_FEATURE(upacting_features, SERVER_OCTOPUS)' failed.
Insisting a bit on starting ceph-osd@65 seemed to get it up and running, maybe there's some "start timeout"?
The loss of the pings towards 1004, and the current source for lost pings are cloudcephosd1043/44/47, they are not yet in the cluster so that should not be an issue.
Looking at the grafana dashboards, noticed that there's a relatively high loss of jumbo frames:
Wed, Aug 20
It will depend on the buildpack itself, some buildpacks allow to install a wider range of versions for the language of choice, some would only support a small range, some also allow to try to install any available version (dotnet iirc) and fail only later if there are any incompatibilities.
I wasn't advocating we implement the protocol, that'd be unwise. I was just questioning the underlying assumption. Do our users really need to have direct access to s3 buckets or just a place to put stuffs? that is not in nfs
What am I missing? why is this a bad approach? the upside is that all the problems of managing multiple auth tokens goes away. We just do things the same way we currently do it in toolforge.
Tue, Aug 19
This is deployed already, feel free to reopen if you still see the issue.
Not all projects are currently replicated, on tools-harbor-2:
Mon, Aug 18
This is weird, as both calls are to the same exact endpoint, so it's not likely a change in behavior between calls.
This might also be alleviated by having a caching proxy of sorts to avoid always hitting external services (that would also speed up some processes).
Talking out loud a bit here :)
I don't see anything on https://wikitech.wikimedia.org/wiki/Help:Toolforge/API about OAuth authentication.
Fri, Aug 15
Yep, that is the external endpoint, for which certificate -based Auth is not allowed, the other is internal, for which it works, if you were using the token aouth, or getting a non-authed endpoint to t would work. We should probably add that to the spec though, if it's not there (haven't checked).
Thu, Aug 14
Two ideas come right away to me:
fyi. The config schema for tool configuration was created in T397724: [components-api] Provide a standalone version of tool config schema
I'm actually not sure what would happen if I set different refs for the same repo here... I assume the last build wins and updates the latest tag (it doesn't matter in my use case, but that could be unexpectedly interesting).
On the NFS side, I checked the dbus ids (/var/lib/dbus/machine-id) and are all different, and the nfs-client ids are empty, so it should be using the default ("Linux NFS " + hostname), if we need to change that, here it explains a bit: https://docs.kernel.org/filesystems/nfs/client-identifier.html#selecting-an-appropriate-client-identifier
Wed, Aug 13
unsure what you meant by system logs in this context. Is the plan to do this in logs-api instead? if that's the case I'd rather begin with that instead of doing it on jobs-api and having everything discarded later.
@Raymond_Ndibe the list of status here is just a proposal, to be discussed/refined, so that's the first part of the task.