[10:32:59] easy review https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/222 [10:42:05] dcaro: +1d [10:43:16] thanks! [11:30:11] this one should help prevent the ephemeral allocation issues hapenning lately in toolforge: https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/25 [11:41:46] * dcaro lunch [12:36:15] I'll be moving/rebooting VMs off and on today; starting with the toolsbeta nfs server so I'd expect nfs k8s workers to protest [12:47:46] andrewbogott: ack [12:50:45] hmm... is there any reason why we don't have cron installed on our VMs? (specifically in tools k8s workers, though maybe none of them) [12:50:50] we have it on bare metals [12:51:34] I don't think so, other than trying to move everything to systemd [12:54:43] puppet is not pulling it on the bare metals (I think), so maybe it comes there by default :/ [12:57:04] so it's probably present in the baremetal base image but not in the cloud debian images [13:00:50] I found it in some vms where it was installed automatically (not maunally), but to be sure https://gerrit.wikimedia.org/r/1113128 [15:18:40] thanks! [15:21:59] hmm, I'm thinking that it might be related to the way crons were handled in the grid [15:29:01] quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/25 (alert for the workers getting out of space) [16:31:32] Interesting non-usual idea on IaC https://itnext.io/the-12-anti-factors-of-infrastructure-as-code-acb52fba3ae0 [17:42:13] a couple easy reviews https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/25 [17:42:28] https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/26 [17:42:36] that one is failing CI [17:44:00] yep, just saw, fixing [17:45:45] I'll add an extra note about `ALL`/`ci_only` tox entrypoints [17:47:23] now it passed ci :) [17:56:43] hmm, node exporter reports no metrics for `node_process_state` if there's no processes in a given state, that makes it a bit weird in the graps as the last known value is the last != 0 value [17:57:47] tools-acme-chief is reporting 0 though, so maybe there's a fix in a different version [17:59:25] oh, no it does not :/, though I'm seeing 0s there [17:59:27] weird [18:17:07] * dcaro off [18:17:11] cya tomorrow! [18:26:25] Raymond_Ndibe: if you have some time, a review on https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/11 will help me tomorrow to get some other stuff in [18:40:02] andrewbogott: If you are still interested in data on network failures, I have rolled out some updates for my gitlab-account-approval bot that include better logging and automated HTTP retries. I can dump some data somewhere for you and/or show you how to get it directly from the tool's logs. [18:40:11] The logs have things like `2025-01-21T01:30:34Z urllib3.connectionpool WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError(': Failed to establish a new connection: [Errno 101] Network is unreachable')': /r/a/groups/Trusted-Contributors/members?recursive` [18:40:29] I am discouraged but interested! [18:41:15] I will dump this file I have into a Phab paste and also put some notes there on how to reproduce the report. [18:41:28] thank you! [18:45:52] andrewbogott: https://phabricator.wikimedia.org/P72205 [18:48:52] does it every happen for anything other than gerrit? [18:50:26] Yesterday Gerrit seemed to be the hot spot. A few days before it was Phabricator. I think the Gerrit problems in the last 24 hours may have lined up with other external traffic pressure on Gerrit itself. [18:51:35] "Failed to establish a new connection: [Errno 101] Network is unreachable'" looks to me like a local network failure but could it be an unreachable gerrit? [18:52:43] Yeah, I'm not 100% sure, but I think that for instance a gerrit reboot could lead to that failure message. [18:53:43] hrm i recall hearing about some sort of network-level rate limiting for gerrit, i wonder if that's kicking in [18:54:31] the only gerrit dash I see is only about patch traffic [18:54:34] not helpful [18:55:19] taavi: that might be possible, yeah. the collab services folks do have some local firewall stuff that is trying to protect gerrit from rogue crawler bots [18:56:18] gerrit is not behind the CDN's fancier protection rules so they built something more local [19:00:03] * bd808 spots a bug in his new reporting from that dump