[09:10:26] morning! [09:10:31] quick review: https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/67 [09:18:58] done [09:21:20] thanks! [09:38:16] there is this alert [09:38:19] https://usercontent.irccloud-cdn.com/file/yQrAKWzu/image.png [09:38:26] which seems like a puppet problem somewhere? [09:39:45] slyngs: I think this is related to this patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/1085515 [09:40:05] can we safely ignore? [09:41:01] Yes, we are removing the old conntrack alerting [09:41:56] ok [09:42:19] do we get a similar alert based on prometheus instead of icinga? do we need to configure something? [09:42:36] No, same alert, just from Prometheus [09:42:49] great, thanks [09:43:45] It's defined here instead: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-sre/netfilter.yaml [09:45:24] slyngs: but if the alerts are team-sre, will we get them with a label `team=wmcs`? [09:46:08] Ah... hmm [09:49:30] btw, I'm surprised the yaml merge `<<:` is working. Last time I checked, prometheus config did not support that, see https://github.com/prometheus/prometheus/issues/2347#issuecomment-1867689333 [09:50:13] It also doesn't seem to work when I test locally [09:50:58] it's an issue with the yaml standard iirc, and some libs tend not to implement it, some do, but it's not a must [09:52:02] I think it never left the draft stage https://yaml.org/type/merge.html [09:52:29] Hmm I might need to revert the removal of the Icinga check, because I don't think there's a way to get AlertManager to route the alerts to WMCS as is. It's kinda all or nothing. That is an issue we had with other alerts, but I don't recall if there was a fix [09:55:33] dcaro: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1087397 <- I'll talk to o11y and see if there's a way to route the alerts correctly, so that WMCS get the alerts from the cloud hosts and the reset go to SRE. [09:55:57] slyngs: thanks! [10:02:20] dcaro: Sorry about the noice [10:02:49] np, thanks for moving us out of icinga :) [10:20:43] heads up, I will be testing stuff in codfw1dev, potentially making the APIs unstable [10:20:53] (similar to yesterday) [10:28:58] ack [10:32:51] dcaro: is there a reason toolforge-deploy uses the python 3.9 CI image instead of 3.11? [10:34:32] blancadesal: I don't think so, it does run on bastions and k8s control nodes though, maybe it comes from there, looking [10:35:30] they all have 3.11, maybe we forgot to change once the VMs were upgraded to bookworm [10:39:59] in the case of toolforge-deploy, we are not using pre-commit from tox but directly from the CI container, and it has pre-commit==3.0.1 which also causes issues now [10:40:09] so I'd like to move it to 3.11 [10:40:33] +1 from me [10:53:19] mr: https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/578 [10:56:04] LGTM [10:56:24] small usability improvement for lima-kilo https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/208 [10:58:33] nice! [11:07:22] * dcaro lunch [11:07:32] how should I go about getting the logs of a bucket using either radosgw-admin or wmcs-openstack? are the logs enabled? is this possible? anyone? [11:07:49] explored this the other day with dcaro but we didn't go far enough [11:09:07] Raymond_Ndibe: what are you looking for? [11:09:20] dcaro: I forgot to mention this in our meeting yesterday, but the harbor_tests bucket seems to be working. I can push and ever [11:11:04] Raymond_Ndibe: awesome :) [11:11:20] even replicate to harbor_tests, so the issue appears to be from buckets in the region https://object.eqiad1.wikimediacloud.org [11:11:42] for the logs, there's several sources, are you interested on any specific layer? (openstack, ceph, rados, proxies...) [11:12:22] arturo: I am trying to find out why buckets in the region https://object.codfw1dev.wikimediacloud.org cannot be used for harbor storage while those in the region https://object.codfw1dev.wikimediacloud.org can be [11:12:24] you have https://logstash.wikimedia.org/app/home, this has openstack and ceph logs (using the openstack dashboard, deselect the openstack services and filter by cloudceph nodes) [11:13:09] dcaro: tbh I want everything. Every possible place that logs might live [11:13:33] logstash is the best place to start imo, as it aggregates all the nodes and services [11:13:41] yeah [11:13:53] also, debug level logs are not usually enabled in either ceph or openstack [11:14:21] Raymond_Ndibe: I don't understand what you mean by "cannot be used for harbor storage" is there a failure? [11:14:22] I tried playing with logstash a bit but I had no idea what I was looking at or if I am looking at the logs I want. But maybe let me look again [11:14:24] then you can try going to every cloudcontrol (or a specific one if you pinpoint it in logstash), and there you have journalctl I think that has some things [11:15:55] arturo: the buckets just don't work. After connecting them to harbor for every operation I keep getting this weird 500 error unknown. also this happens without connecting to harbor [11:16:38] for example trying to use s3cmd to push to any of such buckets fails with a weird error [11:17:27] even something as simple as `s3cmd info ` retries a number of time with 500 error unkown before finally succeeding [11:17:44] so not particularly related to harbor, it's from the buckets themselves [11:17:49] weird! [11:18:19] it may be related to this error Raymond_Ndibe T360626 [11:18:20] T360626: Frequent radosgw 500 errors with OpenTofu - https://phabricator.wikimedia.org/T360626 [11:18:37] had issues with that for so long I thought I was doing something wrong, until David created a bucket on testlabs in the region https://object.codfw1dev.wikimediacloud.org and that worked without a glitch [11:19:02] oh, I remember looking a bit, might be related to fernet issues? [11:19:03] Was it my imagination that we had a redis option running in openstack, or am I looking for it in the wrong place? [11:19:45] Rook: might have been disabled from trove (I'm guessing you mean as DB in trove), not sure what was the last status there [11:20:08] Ok, I'll stop looking for it then. Thanks! [11:20:14] arturo: yes, I think that's the exact error. Nice to see someone noticed it already in the past [11:21:11] Raymond_Ndibe: reasons for the error are unknown at the moment. Only know solution at the moment is to retry [11:22:56] * dcaro gtg [11:23:39] That is bad. Will keep investigating. Need it working before we can use obect storage for harbor [11:24:54] maybe we can start by looking at the buckets that work and the ones that don't [11:25:20] I suspect this is not related to particular buckets, but for the API machinery itself [11:25:31] it maybe haproxy, radosgw, etc [11:25:36] arturo: also any idea how to increase the size of a bucket? [11:26:31] yes I suspect the same too. reason it's not exactly easy to pinpoint. But comparing the infra backing buckets might help [11:28:00] Raymond_Ndibe: what do you mean the size of the bucket? the quota? [11:29:30] yes the quota. I think the default is around 8 - 9 GiB [11:29:39] yes, 8G is the default [11:29:57] this is documented here https://wikitech.wikimedia.org/wiki/Help:Object_storage_user_guide [11:31:19] ok thanks [11:35:40] Raymond_Ndibe: last time we debugged it we got to the point where we were seeing fernet token decpryption errors on all but one cloudcontrol (in logstash), did you continue exploring that? [11:38:22] essentially https://logstash.wikimedia.org/goto/912e938cb8583c72736cf3b86ff00de3 [11:38:31] dcaro: are you having issue sshing into ssh toolsbeta-harbor-1.toolsbeta.eqiad1.wikimedia.cloud or is it just me? [11:39:07] Raymond_Ndibe: can't login as my user, only root [11:39:19] probably sssd down (sometimes it stops working after OOMing) [11:40:08] dcaro: did we look at logstash? we mostly looked at the servers together playing with radosgw-admin and wmcs-openstack [11:40:43] Raymond_Ndibe: yep we did [11:41:36] I didn't look at that again. I just backed up the commands we were playing with on the servers. Maybe that's a good place to continue from [11:41:47] hmm, for toolsbeta it's not sssd, it seems it's not getting replies from ldap [11:42:13] Raymond_Ndibe: yep, can you update the ticket with that info? [11:42:20] gtg, be back in a bit [11:42:42] We should increase the size of toolsbeta-harbor server. the storage is like 20GB and harbor is already consuming close to that [11:42:52] ok I'll do that [13:07:31] there's bugfix updates for mod_oidc, is there anything to consider when updating it? the update would trigger an apache reload, but should be brief [13:19:39] moritzm: is that being used in the IDP setup? [13:26:49] well, indirectly since the OIDC protocol ultimately ends up on the IDPs, but on cloudcontrol it is used by some uwsgi keystone proxy [13:26:57] but not sure what this does exactly? [13:27:36] I'm not sure either, but you may safely update it. Keystone can generally survive an apache reload [13:32:10] ok, thanks! I'll do that in a few minutes [13:56:36] these are done now [13:58:49] thanks [15:04:08] dcaro: https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/64 [15:06:39] blancadesal: they all show empty to me :/? [15:07:31] oh, because of https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/blob/main/.gitattributes?ref_type=heads#L2 [15:07:38] I'll have to checkout to review [15:08:21] locally they look fine [15:08:25] (I think?) [15:08:34] * blancadesal afk for a while [15:09:36] ohhh, [15:10:01] I think it might be related to the certs dates being hardcoded in the recording, and then the code might try to renew stuff and such [15:10:24] we might want to use freezegun or similar [16:16:51] yeah, that makes sense, specially since the cert lifetime was reduced to 10 days, from a year [16:25:30] it seems I will make it to the toolforge monthly meeting at least partially [17:54:25] /usr/local/bin/wmcs-dnsleaks is failing with "Could not find project: tf-infra-test" [17:56:24] I think that project was removed? (I saw a task flying by or something) [17:56:34] gtg [17:56:36] * dcaro off [17:57:03] yes the project was replaced by the newer "tofuinfratest" [17:57:14] but maybe some traces were left [17:57:31] I'm trying to figure out why the script is failing [18:02:04] it looks like there is something still associated to that deleted project [18:15:05] left some notes at T379076 [18:15:06] T379076: Remove tf-infra-test project - https://phabricator.wikimedia.org/T379076 [18:16:41] from what I can see this is not causing other issues apart from that dnsleaks script failing, which is not critical [18:24:46] * dhinus off for today