[07:42:42] 10Mail, 06Infrastructure-Foundations, 06SRE: Message sizes exceeding limits - https://phabricator.wikimedia.org/T383271#10443606 (10Aklapper) [08:22:58] 07Puppet, 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10443703 (10MatthewVernon) [09:20:28] 10SRE-tools, 06Infrastructure-Foundations, 06SRE, 07Python3-Porting: Puppet tox: properly lint both Py2 and Py3 files - https://phabricator.wikimedia.org/T184435#10443781 (10Volans) @elukey good question. Surely we're not working on this but we still have python2 code around, not too much but there is. I'm... [10:23:25] FIRING: [2x] SystemdUnitFailed: user@0.service on ganeti6001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:28:25] RESOLVED: [2x] SystemdUnitFailed: user@0.service on ganeti6001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:10:14] cdanis: for when you are around, is there a good way to filter NELs for a given ASN? [14:10:28] topranks: yep that's on the dashboard even :) [14:10:34] so for instance if I want to look at NELs from AS9428 ?? [14:10:35] although it combines ISP and ASN into one field, they are present separately [14:10:48] yeah, is there any way to know what the string for the ISP name is going to be? [14:10:55] they are also separate fields in the messages [14:11:17] I looked at one of the messages all I could see was the isp_asn field (combined), let me look again [14:11:36] it's buried :) [14:11:49] topranks: https://logstash.wikimedia.org/goto/fbcf47601ff5f608af4ec77707fb84eb [14:11:50] http.request_headers.x-geoip-as-number [14:11:55] yep :) [14:12:00] I'd glanced over the http ones thinking they'd not be it [14:12:03] cool, thanks! [14:12:11] yeah, it actually comes from Varnish [14:12:24] gets set as a backend request header as it constructs the req to send to EventGate [14:15:17] topranks: I see zero NELs from them ever... strange [14:16:22] yeah that is odd [14:16:57] though they are the upstream of the network I was actually interested in [14:17:02] ahh [14:17:12] Bharti are obviously a large provider but perhaps that ASN is just their core or something idk [14:17:27] did you see the email from iWiSP btw? [14:19:24] I'm just seeing it now [14:19:27] * topranks reading [14:25:34] I need to go into a meeting I'll respond to him later thanks for the heads up must have missed it over break [15:21:17] 10netops, 10Ceph, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501#10444844 (10dcaro) Our current version of ceph does not support the `mon_use_min_delay_socket=true` option :/, so only for osds then. To set... [16:00:36] topranks: o/ not urgent but I am working on wikikube-worker2022, it doesn't DHCP and I noticed something strange. The move-vlan cookbook tells me that nothing needs to be done, but in https://netbox.wikimedia.org/dcim/devices/2491/interfaces/ I see a /22 IP address, and in other hosts in lsw1-c6-codfw I see a /24 (for example https://netbox.wikimedia.org/dcim/devices/2494/interfaces/) [16:01:06] I am very ignorant about the new format for the VLANs, this is why I am asking [16:02:43] I'll take a closer look after the meeting [16:03:08] it's fine to be on the /22 (row-wide vlan) in that rack, but ideally if re-imaging we should move to the new / rack-specific vlan [16:36:16] yep yep, what I am wondering is if the new vlan format uses /24s or /22s, because if I run move-vlan on that host I get that it is already good [16:37:28] full context: Je*lto was reimaging + move-vlan it, a netbox timeout caused a failure in the cookbook. While investigating an issue with PXE + DHCP (our lovely duo), I checked the host'IP and got confused [16:37:32] info in https://phabricator.wikimedia.org/T383228 [16:37:44] in the cookbook we have [16:37:44] if self.pre_config['vlan'] not in LEGACY_VLANS: [16:37:45] logger.info('Server not in a vlan requiring a migration, nothing to do. 👍') [16:38:04] the host is indeed not in a legacy vlan, but probably we should also check the IP? [16:39:12] * topranks looking [16:40:11] elukey: yes this is an odd scenario, I'm unsure how things ended up this way [16:40:19] I wonder did the netbox failure mean it only got half completed? [16:41:04] topranks: I guess so yes, I didn't check in details the cookbook logs but I'll do it. What do you think it is the best course of action? Fixing manually? [16:41:12] the host has IPs from private1-c-codfw (legacy), but in Netbox the switch port has been configured for private1-c6-codfw (new) [16:41:28] exactly this was the bit that I didn't get [16:42:48] I'm guessing maybe the timeout was the issue? [16:42:58] The switch vlan was modified this morning through automation [16:42:58] https://netbox.wikimedia.org/extras/changelog/205343/ [16:43:13] what I am wondering is - if we add a check for the IP subnet in the cookbook, basically forcing this use case to not be considered "ok", could a complete re-run fix everything? [16:43:22] I guess if there was an issue that caused the cookbook to fail after that, but before it re-assigned the IPs, this could happen? [16:44:56] elukey: yes we could add a check for this, though I'm not sure it's needed. we have a nightly report that will catch it [16:44:58] https://usercontent.irccloud-cdn.com/file/NRLal3wR/image.png [16:44:58] from the stacktrace posted in the task it seems that it broke at self.update_netbox_ip_vlan(self.post_config) [16:45:20] I think the simplest way to progress might be to manually set the vlan back to what it was on the switch [16:45:23] that's like 1 click [16:45:28] ahhhhh right [16:45:33] okok makes sense [16:45:34] then if you re-run it should detect it needs to change the vlan and do everything from the top [16:45:38] let me do that now [16:45:53] is it just edinging the interface's data? [16:46:29] edinging? editing? [16:46:36] hahahah yes sorry [16:46:37] yeah it's just on this one [16:46:37] https://netbox.wikimedia.org/dcim/interfaces/36820/ [16:46:40] I just did it [16:47:00] 'edit' button in the top right, then scroll down and change the vlan back to private1-c-codfw [16:47:28] ok so I can now run move-vlan now IIUC [16:48:57] yep should hopefully work right now [16:49:05] sudo cookbook sre.hosts.move-vlan inplace wikikube-worker2022 [16:49:18] (it was previously called by reimage but this is faster) [16:49:47] ah no, not implemented, I think I need reimage [16:49:58] running it [16:51:04] it seems working :) [16:54:26] finished! [16:56:11] why we ended up in this weird state? [16:56:22] that required manual editing [16:59:28] volans: there was a netbox timeout when calling it IIUC [16:59:43] it was called by reimage though (move-vlan) [17:02:17] and did it retry at all? or all retries failed [17:02:27] and why did it continue if it timed out? [17:02:34] worth to add something in the rollback()? [17:04:13] from the stacktrace it seems that it retried yes [17:04:43] we could add some checks in move-vlan, so if it rehappens we can signal it to the user [17:04:59] not sure how easy is to properly rollback, maybe we could check it [17:05:19] (the host is in d-i now, so everything seems working, thanks topranks !) [17:05:27] great that it works [17:05:52] yep glad that worked! [17:06:44] FIRING: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [17:11:03] jelto: wikikube-worker2022 is finishing reimage, it should be ready by tomorrow (I'll check the final execution status later on) [17:15:16] elukey: oh great thank you! Thanks for figuring out what was needed in netbox. I'll check the node tomorrow and pool it again if it looks good [17:15:43] jelto: np! Most of the credits to topranks though! [17:18:40] If needed I can paste the cookbook output somewhere, it should be on the cumin hosts still. [17:19:13] 10netops, 10Ceph, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501#10445361 (10cmooney) This seems to be working ok following the merge. Packets are being properly matched in the iptables rules and the DSCP m... [17:26:44] RESOLVED: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [17:48:44] FIRING: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [17:58:44] RESOLVED: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [20:12:44] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [20:17:44] FIRING: [2x] NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [20:22:44] RESOLVED: [2x] NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [22:10:51] 10CAS-SSO, 06Data-Engineering, 06Data-Engineering-Icebox, 10Data-Engineering-Jupyter, and 2 others: Improve the JupyterHub services and use CAS/SSO - https://phabricator.wikimedia.org/T260386#10446233 (10Ottomata)