[08:27:37] JFTR, I've decommed lists1003 so that should reduce noise over the quiet period (confirmed with Lukasz, "Collaboration Services" will pick up the bookworm mailman install on the lists1004 hardware that was bought) [08:40:33] <3 [08:43:03] <3 that helps! [09:01:32] moritzm: the motd on cumin1001 is nice but is buried in the rest of the motd... I guess either we put something larger like the one on deploy1002 or it will get missed [09:01:58] optionally colored too :D [09:42:38] nah, should be fine, there was also a mail to sre-at-large as well [09:48:46] No no add "\033[31;5mPlease use cumin1002/cumin2002 for all cookcooks and Cumin\!\033[0m" [09:49:35] as you want, but I can guarantee noone will read that small uncolored line ;) [09:50:27] anyway we can easily check its usage for cookbooks via https://sal.toolforge.org/production?p=0&q=%40cumin1001&d= [09:51:02] and for cumin from: sudo tail -n1 /var/log/cumin/cumin.log [11:46:29] 10SRE-tools, 10Dumps-Generation, 10Infrastructure-Foundations, 10serviceops, 10IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10Clement_Goubert) >>! In T271142#9413333, @akosiaris wrote: >>>! In T271142#9382040, @Volans wrote: >> Another... [12:30:00] volans, slyngs: when the two patches for moving the debmonitor client to the new source package are complete, let's build and deploy 0.3.2-2 and roll it out? then we have a clean separation between the new client package and the future server package [12:30:31] Sounds good [13:00:52] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox: get rid of WMF Production Patches - https://phabricator.wikimedia.org/T310717 (10ayounsi) Submitted https://github.com/netbox-community/netbox/issues/14554 and https://github.com/netbox-community/netbox/issues/14555 [13:03:46] for the ganeti hosts listed in https://phabricator.wikimedia.org/T253173#9412420 should I create a dedicated task or people are ok to work from that one? [13:23:42] volans, moritzm about the motd I opened that task not long ago as its getting out of hands... https://phabricator.wikimedia.org/T352957 [13:37:56] Homer upgraded, tested on *ulsfo* without any issue, hopefully Homer diff emails will be full quiet during the break [13:39:42] who knows, but finger crossed! [13:40:54] 10homer, 10netbox, 10Infrastructure-Foundations: Homer daily diff failing in codfw - https://phabricator.wikimedia.org/T329823 (10ayounsi) 05Open→03Resolved a:03ayounsi Homer 0.6.5 deployed, I'll re-open if the workaround doesn't workaround. [14:05:25] 10netbox, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, 10Patch-For-Review: Netbox: Add support for our complex host network setups in provision script - https://phabricator.wikimedia.org/T346428 (10Volans) Thanks for the patch, it would take a bit to do a full pass given the size. I agree... [14:57:53] 10netbox, 10Infrastructure-Foundations, 10IPv6, 10User-jbond: Some clusters do not have DNS for IPv6 addresses (TRACKING TASK) - https://phabricator.wikimedia.org/T253173 (10MoritzMuehlenhoff) >>! In T253173#9412420, @ayounsi wrote: > Those hosts only have a SLAAC v6 IP: `ganeti[2010,2016,2020].codfw.wmne... [15:26:29] 10SRE-tools, 10Data-Persistence, 10Infrastructure-Foundations, 10Patch-For-Review: Automation to change a server's vlan - https://phabricator.wikimedia.org/T350152 (10Marostegui) [15:38:56] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Netbox network report failing - timeout errors - https://phabricator.wikimedia.org/T321704 (10ayounsi) [15:57:38] jhathaway: on puppetmaster2001 there are a few session-c107960.scope session failed for the gitpuppet user. Should those be cleared out? Is that a leftover from a previous issue or an ongoing one? [15:57:50] the icinga alert says it's like that since 46 days... [15:58:30] yeah I waas not aware, but happy to take a look [15:59:59] this should be https://phabricator.wikimedia.org/T199911 [16:00:16] there's a toil::system_scope_cleanup class which deploys a workaround [16:00:44] we only enable it on swift and IIRC some backup hosts (where this shows up more often) [16:01:20] I recall that workaround but I've never seen that needed on the puppetmasters [16:01:21] so let's maybe clear our the current ones manually and if there is a more persistent pattern we can add the class to puppet server as well [16:01:29] so I ws wondering if there was a different problem here [16:01:42] I think it's the same, but we probably already solved it: [16:02:03] before Jesse enabled the caching mode on the puppet servers we had the sitatuin that they were under high load [16:02:13] which also aligns with the 46 days [16:02:31] ok then manual clear and wait to see if it re-occurers seems the way to go [16:02:46] so I'd say, let's clear these out manually for now, likely this won't happen again with the current setup [16:02:56] agreed [16:02:57] ok, I'll clear them out [16:03:10] thx