[02:34:00] 10serviceops, 10Infrastructure-Foundations, 10WikimediaDebug, 10Performance-Team (Radar): Upgrade php-excimer package from 1.0.4 to 1.1.1 - https://phabricator.wikimedia.org/T332964 (10Krinkle) @MoritzMuehlenhoff I believe this is something your team usually do, but not 100% sure. Feel free to re-route as... [08:30:02] 10serviceops, 10Infrastructure-Foundations, 10WikimediaDebug, 10Performance-Team (Radar): Upgrade php-excimer package from 1.0.4 to 1.1.1 - https://phabricator.wikimedia.org/T332964 (10MoritzMuehlenhoff) >>! In T332964#8723594, @Krinkle wrote: > @MoritzMuehlenhoff I believe this is something your team usua... [09:40:55] 10serviceops, 10Dumps-Generation, 10SRE, 10MW-1.39-notes, and 2 others: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10Joe) [09:45:32] hi folks [09:45:55] so the hosts kafka-main[12]00[1-3] are very old and probably in need of a refresh next year [09:47:02] they seem to have dhcp problems in d-i after PXE, and in order to make it work we'd need to update idrac+nic+bios, but since they are old it may or may not work (I've done the upgrades to kafka-main[12]00[45] and it worked nicely, but those are more recent hosts) [09:47:17] Moritz suggested to simply dist-upgrade to bullseye, would it be ok for you? [10:23:55] 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10elukey) From a conversation with Moritz on IRC about how to dist-upgrade safely: ` 10:09 - disable Puppet 10:09 - update /etc/apt/sources.list to use s/buster/bullseye 10:09 - apt-get dis... [10:24:04] added the procedure in --^ [10:53:16] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup): Relax nodeAffinity of sessionstore - https://phabricator.wikimedia.org/T325139 (10eoghan) 05Open→03Resolved {F36924942} We have an alert to catch the condition where a pod gets scheduled on a non-dedicated ho... [10:53:53] <_joe_> elukey: say it fails; what is our plan? [10:57:50] _joe_ it should be unlikely since we already have 4 nodes running bullseye, but I think that we will try the idrac+nic+bios upgrade for sure [10:58:05] it should work, but not all nodes have been tested etc.. [10:58:23] it proved that in newer ones it solved the reimage issue (dhcp in d-i not working) [10:59:21] and in the worst possible use case we'd fall back to do what we'd do if a node breaks completely [10:59:41] (so either repurpose another one and assign the broker id etc.., or move the partitions to other nodes) [10:59:46] does it make sense? [11:03:38] <_joe_> uhm [11:03:51] <_joe_> yes :) [11:08:31] ack will do the first one next week then [14:45:16] If someone could take a gander at this change to our apache2 puppet module that would be appreciated, https://gerrit.wikimedia.org/r/c/operations/puppet/+/902501, I would also like feedback on moritz's suggestion on rolling it out if possible [14:55:15] jhathaway: o/ my2c - it is more painful but I'd try to have a flag to enable/disable the new feature, and control it via hiera settings in roles. It should be fine but I'd be way more comfortable if we could roll it out to say only canaries (that have a separate role etc..) and let it bake for some days [14:55:53] once we are confident on the change then we can enable it by default and that's it [15:06:20] sounds good, I'll take that route elukey [16:24:00] 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Replace Nutcracker - https://phabricator.wikimedia.org/T333019 (10hnowlan) [16:48:10] elukey: if you are still around do you think you could take a gander at the revised patch, https://gerrit.wikimedia.org/r/c/operations/puppet/+/902501 [16:53:42] I'm not sure how I feel about that role check [16:54:39] It should be a variable in hieradata/role/common/mediawiki/appserver/canary_api.yaml and hieradata/role/common/mediawiki/canary_appserver.yaml set to true, and default to false, then we can turn it on progressively to other roles [16:54:57] But it also feels a bit overkill to do the whole lookup traversal [16:56:25] claime: yeah, given this code is hopefully short lived, i.e. I think this is safe to go everywhere, I thought an inline check was fine, but I can be persuaded otherwise [16:57:35] jhathaway: I don't have a strong feeling about it either, provided it does go away pretty quickly :D [16:59:27] I'm pretty sure any box running debian Jessie or newer should work with this code, but maybe their is a Wheezy box lurking around somewhere [16:59:59] * claime 😨 [17:25:02] anyone feel comfortable giving me a +1 on the patch? [19:11:38] 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Replace Nutcracker - https://phabricator.wikimedia.org/T333019 (10jijiki) Thumbor is using nutcracker for memcached sharding, thus we can happily use mrouter there :) [21:21:34] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Patch-For-Review: March 2023 Datacenter Switchover Excluded services - https://phabricator.wikimedia.org/T329193 (10Dzahn) excluded by releng: - https://integration.wikimedia.org - https://releases-jenkins.wikimedia.org [21:53:10] you can run "lsb_release -c" on * via cumin to get a nice summary of what distros exist and how many. there is still cloud though. [21:53:46] that was re: a couple hours ago because of the "wheezy" mention [21:54:18] just look at puppetboard's facts [21:54:39] there is also https://os-reports.wikimedia.org/ [21:54:41] have also a pie graph of the distributiob [21:55:12] also we have a cumin alias for each distro that gives you the number [21:55:44] and no jessie is the oldest [21:55:52] sorry stretch [21:56:00] no jessies around [21:56:45] * volans|off fades away