[07:52:23] 10netbox, 10Infrastructure-Foundations: netbox: drop profile::netbox::active_server parameter - https://phabricator.wikimedia.org/T309034 (10ayounsi) Middle/longer term the reports status should go through Prometheus so we could revisit at this point. Until then I agree with Riccardo. As it's the same databas... [08:43:52] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Puppet should prune stale entries from sudoers.d - https://phabricator.wikimedia.org/T309268 (10MoritzMuehlenhoff) p:05Triage→03Medium [09:45:53] 10SRE-tools, 10Spicerack: sre.ganeti.makevm NXDOMAIN race condition - https://phabricator.wikimedia.org/T309505 (10jbond) p:05Triage→03Medium [09:53:24] topranks: there are homer diffs for 4 devices, are those expected? (asking because 2 are asw and people might run into those diffs while deploying usual trivial changes) [10:09:58] volans: shit my bad thanks. wasn't aware the labs filter was also used in codfw, updating now. [10:10:21] The ASW changes look like some dc-ops work, not related to anything I was doing. [10:34:13] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ayounsi) Please don't forget to run Homer after re-naming as the switch port description contains the hostname. The current outstan... [10:51:41] 10SRE-tools, 10Icinga, 10Infrastructure-Foundations, 10SRE, 10observability: Icinga paged for a host that should have been downtimed - https://phabricator.wikimedia.org/T309447 (10Volans) [10:53:07] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.2 - https://phabricator.wikimedia.org/T296452 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jbond@cumin2002 for hosts: `netbox2002.codfw.wmnet` - netbox2002.codfw.wmnet (**WARN**) - //Host not found on Ici... [10:53:34] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: sre.ganeti.makevm NXDOMAIN race condition - https://phabricator.wikimedia.org/T309505 (10Volans) Sure, let's call that cookbook from the makevm at the right time. [12:40:48] 10netops, 10Infrastructure-Foundations: DHCPd: update config to log more info - https://phabricator.wikimedia.org/T309524 (10jbond) p:05Triage→03Medium [13:29:29] the reboot-single cookbook now emits some error which I haven't seen before, didn't dig deeper yet, but maybe some underlying library code got updated which broke it? [13:29:59] Caused by: Unable to extract data with for idp-test1002.wikimedia.org from: System is going down. Unprivileged users are not permitted to log in anymore. For technical details, see pam_nologin(8). [13:30:00] 446261.89 865131.15 [13:30:02] Caused by: could not convert string to float: 'System' [13:30:19] rest of the output looks like befpre [13:32:55] although it might also be some kind of rare race condition, I didn't get that message with a second reboot [13:52:39] Could be some message that has changed, given that it tries to parse af string as a float [13:53:03] moritzm: it's trying to convert to a float the output [13:53:06] including stderr [13:53:19] it was by choice to detect those kind of issues tbh [13:53:31] but if that stderr output is "expected" we can fix the parsing side of it [13:58:40] the message by pam_nologin is kind of expected given the underlying Debian setup. but it's strange that it's the first time we're hitting this [13:58:52] given I've used that cookbook hundreds of times [14:05:56] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10ArielGlenn) a:05ArielGlenn→03None Not sure who should get this next but it's not Hannah or I :-) I was never involved in the configuratio... [14:45:12] 10netops, 10Infrastructure-Foundations, 10SRE: DHCPd: update config to log more info - https://phabricator.wikimedia.org/T309524 (10Volans) IIRC that hostname is evaluated by the DHCP at restart time and then the resulting IP is used in the configuration. Because that's a valid hostname in our DNS it would h... [14:48:16] moritzm: yeah that's what's weird. stderr has never been discarded [14:48:35] so if that message would appear randomly we should have caught earlier [14:48:58] that's what makes me think something is not totally right in this specific host to cause it [14:58:32] it's one of the new IDP bullseye VMs, not 100% done yet with the setup, but otherwise pretty normal [14:58:44] I'll check if I can repro [14:59:46] ack thx [15:00:19] try also with cumin to see if there is a difference [15:08:24] volans: that looks like anb error comming from wait_reboot_since. it is trying to parse srderr as the uptime. see the first word of the error is "System...[is going down. Unprivileged users]" [15:08:43] this is the same error i got with the SendEnv LANG issue resolved recently [15:09:02] and also looks liek the following patch intended to fix a simlar issue https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/747155/5/spicerack/remote.py#562 [15:09:14] jbond: yes that's what I said above [15:09:24] oh sorry missed that [15:09:24] is trying to conver the output, incuding stderr, to a float [15:09:33] but that was kinda intended, to spot those issues [15:09:52] if we ignore blindly stderr [15:10:04] it could hide various things [15:10:24] ack [15:10:54] but whatever is the conssensu, we can adapt to that :) [17:42:42] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10MoritzMuehlenhoff) p:05Triage→03Medium [17:43:59] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Wikimedia-Incident: 2022-05-09 Exim BDAT Errors incident - https://phabricator.wikimedia.org/T309238 (10MoritzMuehlenhoff) p:05Triage→03Medium