[00:05:48] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:05:47] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:41:39] I'm back ! [07:47:17] elukey, volans|off, opening the pandora box, but shouldn't we re-consider using MAC addresses for DHCP? :) [07:49:10] for https://phabricator.wikimedia.org/T365372, and to get rid of the problematic option 82, and to not require all the extra work for option 97 (which we don't know if it's supported on supermicro too https://phabricator.wikimedia.org/T322578#8610824) [07:49:45] and to have one single mechanism for it all [08:05:47] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:13] XioNoX: how would you get the MACs? [08:06:23] and welcome back! [08:06:38] volans: barcode on the device? [08:06:54] so manual and painful? [08:07:40] volans: depends :) [08:11:26] volans: I think it's worth discussing the options, especially seeing how painful option 82 is, and the possible lack of alternatives for supermicro [08:12:10] it's standard practice in lots of places to just scan the server's QRcode to automatically enter all its details (serial, etc) [08:12:27] you're mixing prod and mgmt dhcp issues though [08:13:26] volans: well, if we can have a single process for both sides it would be even better [08:14:03] seeing DHCP as a whole and not just per type of servers, otherwise there is a risk of even more fragmentation [08:36:27] 10homer, 10SRE-tools, 06Infrastructure-Foundations: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415#9897100 (10ayounsi) It's necessary to do the diff on all target devices anyway, so that behavior is fine. For example, if we run `homer "*ulsfo*" commit "foo"` to change a SSH k... [08:46:37] If I have a host that does not respond on its main interface, the management interface, and ipmi fails with "Error: Unable to establish IPMI v2 / RMCP+ session", is there something more I can try, or is it straight to dc-ops? [08:46:47] mw2321.codfw.wmnet is the host [08:47:19] claime: does remote IPMI fails quickly or slowly? [08:47:41] password is the management password right? [08:47:50] yes [08:48:03] takes a few seconds, so I'd say slowly [08:48:24] around 20 to 30s [08:48:43] ok, was to exclude pwd out of sync ;) https://wikitech.wikimedia.org/wiki/Management_Interfaces#If_it_fails_very_quickly: [08:49:12] redfish fails too FWIW [08:49:39] claime: I can't ping the mgmt IP, did it change? [08:49:56] shouldn't have [08:50:02] it's not a recent reimage or anything [08:50:06] then mgmt is dead [08:50:10] great [08:50:23] at least remotely [08:50:54] so yeah I'd say you need physical access on this one [08:51:08] ok, opening a task then [08:51:10] thanks [08:54:53] 10netops, 06Infrastructure-Foundations, 06SRE: Should we channelize unused QSFP28 ports on QFX5120s to provide 'buffer' for 10G upgrades? - https://phabricator.wikimedia.org/T367408#9897254 (10ayounsi) Can we move the cables instead of moving the servers ? For example Port 44 to 47 can be used right away at... [08:55:27] power cycle the box :( [08:56:09] yeah will probably need a cold reboot [09:01:16] 10homer, 10SRE-tools, 06Infrastructure-Foundations: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415#9897371 (10Volans) I like the last proposal but I was thinking that there is an additional case: 1. apply to this device and ask for the next one unless already cached and appro... [09:06:40] 10homer, 10SRE-tools, 06Infrastructure-Foundations: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415#9897418 (10ayounsi) Yeah I think it's what I tried to mean with > We can also decide that batch means to silently skip any device that have a different diff, to not risk blockin... [09:24:34] 10homer, 10SRE-tools, 06Infrastructure-Foundations: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415#9897529 (10cmooney) For my part I like “3” as set out by Volans above. @ayounsi is your proposal that “batch” would be a valid answer (in addition to yes/no) when presented wit... [09:27:03] who is the puppet expert those days ? I'd love a review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1037784 :) [09:31:24] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9897565 (10MoritzMuehlenhoff) >>! In T367487#9891993, @SLyngshede-WMF wrote: > I've run a test build, Java 21 is a hard requirement, it cannot be older or newer. > Otherwise the overlay upgrade conta... [09:51:49] 10netops, 06Infrastructure-Foundations, 06SRE: Should we channelize unused QSFP28 ports on QFX5120s to provide 'buffer' for 10G upgrades? - https://phabricator.wikimedia.org/T367408#9897652 (10cmooney) @ayounsi perhaps I was a little quick to conclude all the blocks were assigned, you are correct. The advan... [11:05:54] 10netops, 06Infrastructure-Foundations, 06Traffic: drmrs/esams/magru LVS : remove cross-rack links - https://phabricator.wikimedia.org/T367731 (10ayounsi) 03NEW [11:13:11] 10netops, 06Infrastructure-Foundations, 06Traffic: POPs LVS : remove public vlan trunking - https://phabricator.wikimedia.org/T367732 (10ayounsi) 03NEW [12:05:47] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:06:52] 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9898123 (10MoritzMuehlenhoff) [12:36:37] 10netops, 06Infrastructure-Foundations, 06Traffic: drmrs/esams/magru LVS : remove cross-rack links - https://phabricator.wikimedia.org/T367731#9898286 (10Vgutierrez) Don't be to aggressive with this one, we could need to rollback at some point, let's wait a few weeks at the very least [12:49:50] moritzm: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1046659 seems to be migrating from the py3 version to the py2 version of the script... is that intended? [12:54:56] XioNoX: (reading the backlog about DHCP) - maybe we could discuss this after the backlog review, or during team meeting (not sure yet when would be the best time) [12:55:28] +1 [12:55:58] so far we also have the problem of the default mgmt bmc password that is not a standard one like DELLs, but it changes for every Supermicro (and it is printed on the server's label IIUC) [12:57:21] to quickly unblock the next load of servers that we'll get we could even think about asking explicitly the mac-address in the provision cookbook, while we study a better solution. It is manual and horrible I know but I don't see other wayts [12:57:25] *ways [12:57:45] (or we store the mac address of the mgmt in netbox) [12:58:47] FIRING: [3x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:03:06] XioNoX: I can check the puppet change later on, but I am not John, I probably know 1/10th of his knowledge :D [13:05:09] volans: yes. the old py3 WIP one was never working properly and the bullseye rebuild will stick with Py2 given that we'll ditch/replace this service with something new anyway [13:06:14] ah ok so it's basically a revert, because some of the changes are technically unnecessary and they will work on py2 too [13:07:09] kinda, but it's not really a revert, given the py3 stuff was never actually live. [13:07:44] revert git wise :) [13:08:24] do we know what didn't work with py3? The script seems really straightforward, maybe we can check if we can change/fix it to stay with py3 [13:10:17] not in the script itself, but using the py3 or py-irc changes a lot of things in terms of string/bytes handling of the IRC messages being processe [13:10:49] sigh [13:11:53] volans: I need to fix the test_matching_vlan() function in the Netbox network report, it's failing because we have some IPs configured as /32s and the netmask not matching is causing the failure. The simplest way forward is probably to use the Python ipaddress module, any reason not to import it in the report? [13:12:39] topranks: caused by the routed ganeti? [13:13:00] no it's on gitlab2002 [13:13:23] no problem at all, also if you get an address from the netbox api often the .address property is already an IP address/interface object [13:13:39] but sure feel free to import ipaddress at will, it's stdlib [13:13:59] I think it just looks at physical ports so probably routed ganeti won't be an issue - but I'll check for that while I'm in there make sure we're future-proofed [13:14:18] ok thanks, will double check if .address is already ipaddress.ip_address cool [13:16:33] depends which object you have at hand but usually yes [13:18:48] topranks: actually sorry they are netaddr objects [13:18:57] 208.80.153.8/32 [13:18:58] [13:19:04] ah ok [13:19:09] most likely you can do what you need with them too [13:19:26] I'll see how I get on anyway, if I can avoid importing it I will, otherwise I'll use ipaddress [13:19:29] thanks [13:19:59] netadd predates when ipaddress was available in the stdlib but is also more powerful, so surely does what you need :D [13:40:31] elukey: 1/10 of John is already plenty enough :) [13:41:09] elukey: yeah, happy to discuss the DHCP stuff. Is it possible to have a picture of the barcode/label and all the information that it contains? [13:45:38] I can ask to Jenn if they can check, following up now [13:46:09] we can make netbox accept image uploads and do OCR on the mgmt pw ;) [13:46:36] * elukey cries in a corner [13:46:37] cdanis: hopefully everything is encoded in a qrcode and we just need to scan it [13:46:51] in a netbox script or similar [13:49:13] 10netops, 06Infrastructure-Foundations, 06Traffic: drmrs/esams/magru LVS : remove cross-rack links - https://phabricator.wikimedia.org/T367731#9898761 (10ayounsi) Of course ! not planning on doing it today :) The task is there to not forget. [13:50:43] 10SRE-tools, 10Dumps-Generation, 06Infrastructure-Foundations, 06serviceops, 07IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142#9898774 (10akosiaris) [13:53:26] 10SRE-tools, 10Dumps-Generation, 06Infrastructure-Foundations, 06serviceops, 07IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142#9898785 (10akosiaris) 05Open→03Resolved a:03akosiaris I 've removed * dumpsdata[1001-1003].eqiad.wmn... [13:53:46] FIRING: [3x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:54:15] 10netbox, 06Infrastructure-Foundations, 07IPv6: Some clusters do not have DNS for IPv6 addresses (TRACKING TASK) - https://phabricator.wikimedia.org/T253173#9898789 (10akosiaris) [14:28:55] 07Puppet, 06Infrastructure-Foundations: Install a default timeout for systemd::timer::jobs - https://phabricator.wikimedia.org/T367119#9898956 (10CDanis) Alternatives to consider: * Make this a required field instead of adding a default [harder up-front but potentially safer] * Make omitting this field wmf pup... [14:31:32] 10SRE-tools, 06Infrastructure-Foundations, 10Observability-Alerting, 10Spicerack: sre.hosts.downtime, and any other maintenance processes, should use auto-extending silences - https://phabricator.wikimedia.org/T367466#9898979 (10CDanis) Suggestions from discussion at I/F meeting: * It's probably not necess... [14:57:57] 07Puppet, 06Infrastructure-Foundations: Install a default timeout for systemd::timer::jobs - https://phabricator.wikimedia.org/T367119#9899155 (10MoritzMuehlenhoff) One other option: Add a separate wrapper define systemd::timer::job_capped which has the timeout as a mandatory argument (but without a default).... [15:04:53] 10netops, 06Infrastructure-Foundations, 06Traffic: POPs LVS : remove public vlan trunking - https://phabricator.wikimedia.org/T367732#9899192 (10ayounsi) p:05Triage→03Low [15:05:14] 10netops, 06Infrastructure-Foundations, 06Traffic: drmrs/esams/magru LVS : remove cross-rack links - https://phabricator.wikimedia.org/T367731#9899194 (10ayounsi) p:05Triage→03Low [15:05:35] 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9899198 (10Gehel) [15:05:51] 10netops, 06Infrastructure-Foundations: Capirca setup for routed Ganeti VMs - https://phabricator.wikimedia.org/T367265#9899196 (10ayounsi) p:05Triage→03Medium a:03ayounsi [15:05:58] 07Puppet, 10Cloud-VPS: systemd-timer-mail-wrapper should not send mail as root@wikimedia.org from Cloud VPS - https://phabricator.wikimedia.org/T367028#9899208 (10joanna_borun) [15:06:18] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Create the python-release repository - https://phabricator.wikimedia.org/T367410#9899221 (10elukey) p:05Triage→03Medium [15:08:50] 07Puppet, 06Infrastructure-Foundations: Puppetmaster volatile data not synced to all puppet frontends for a month and a half (2024-04-27 to 2024-06-10) - https://phabricator.wikimedia.org/T367113#9899269 (10joanna_borun) p:05Triage→03Medium a:03CDanis [15:09:58] 10SRE-tools, 06Infrastructure-Foundations, 10Observability-Alerting, 10Spicerack: sre.hosts.downtime, and any other maintenance processes, should use auto-extending silences - https://phabricator.wikimedia.org/T367466#9899294 (10CDanis) p:05Triage→03Medium [15:10:04] 07Puppet, 06Infrastructure-Foundations: Install a default timeout for systemd::timer::jobs - https://phabricator.wikimedia.org/T367119#9899295 (10CDanis) p:05Triage→03Low [15:10:19] 10Mail, 06Infrastructure-Foundations, 06SRE: Postfix inbound rollout sequence, mx-in - https://phabricator.wikimedia.org/T367517#9899300 (10jhathaway) p:05Triage→03Medium [15:10:29] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Tab completion for cookbook names - https://phabricator.wikimedia.org/T367230#9899302 (10joanna_borun) p:05Triage→03Low [15:13:30] 10CAS-SSO, 06Infrastructure-Foundations, 10Phabricator, 10Wikimedia-Phabricator-Extensions: Phabricator profile "LDAP User" link goes to Wikitech, but users no longer need to have a Wikitech account - https://phabricator.wikimedia.org/T366766#9899313 (10MoritzMuehlenhoff) >>! In T366766#9895937, @Aklapper... [15:17:17] 10Mail, 10fundraising-tech-ops, 06Infrastructure-Foundations, 06SRE: Update fundraising mail / firewall settings to use new production mx-in hosts - https://phabricator.wikimedia.org/T367573#9899341 (10cmooney) p:05Triage→03Medium [15:17:20] 10CAS-SSO, 06Infrastructure-Foundations, 10Phabricator, 10Wikimedia-Phabricator-Extensions: Phabricator profile "LDAP User" link goes to Wikitech, but users no longer need to have a Wikitech account - https://phabricator.wikimedia.org/T366766#9899350 (10joanna_borun) p:05Triage→03Low [15:19:23] Jenn already sent me the picture of the supermicro labels - one contains password and mac address, the other doesn't contain anything useful (the QR code present leads to a website with generic info about the model) [15:19:37] 10netops, 06Infrastructure-Foundations, 06Traffic: POPs LVS : remove public vlan trunking - https://phabricator.wikimedia.org/T367732#9899368 (10cmooney) > Slight drawback here is that in ulsfo/eqsin LVS traffic towards public hosts (if any) will hair-pin through the routers and back to the switches. I thin... [15:20:21] elukey: noted, where is the serial# ? [15:20:51] forwarded the email to you [15:20:52] elukey: mac of the mgmt or the host? [15:21:03] if it matches the pdf it was mac of the mgmt [15:21:28] yeah it is with the BMC password, we need to check but I hope it is the mgmt mac :D [15:21:46] elukey: no serial# on the pics :) [15:22:19] XioNoX: I though it was the P/N value in the supermicro label [15:22:33] That's the part number [15:23:08] then I have no idea, so there is another label with the serial usually? [15:24:37] there should be, usually on the chassis but check the PDF [15:24:56] see https://www.supermicro.com/en/support/rma/sn [15:25:45] elukey: dunno, dell puts them all together, not sure for supermicro, but it should be somewhere [15:26:10] what I am wondering is - do we want to know if there are other info where the serial is? Because in theory in this case it shouldn't be needed (we already have it in netbox) [15:27:08] elukey: if they put it in netbox it's by copying it from somewhere, no? [15:27:16] maybe it's on the box itself [15:27:33] it would be useful to document all the info that comes with the server and where the info is [15:27:43] okok I wanted to know this :) [15:28:10] It is also possible that some info come in other forms (like a spreadsheet) and DCops only checks [15:29:30] yeah maybe, we're discovering a new platform, so it changes the way we do things [15:30:29] could be useful to have it all documented in a new subpage of https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation [15:32:11] +1 [15:59:37] 10CAS-SSO, 06Infrastructure-Foundations, 10Phabricator, 10Wikimedia-Phabricator-Extensions, 13Patch-For-Review: Phabricator profile "LDAP User" link goes to Wikitech, but users no longer need to have a Wikitech account - https://phabricator.wikimedia.org/T366766#9899571 (10Aklapper) a:03Aklapper [16:24:03] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 22.2R3 - https://phabricator.wikimedia.org/T364092#9899763 (10cmooney) [16:25:01] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 22.2R3 - https://phabricator.wikimedia.org/T364092#9899765 (10cmooney) >>! In T364092#9766653, @ayounsi wrote: > Both Junos 22.2R3-Sx and Junos 22.4R3 are latest recommended. fyi, I went with 22.4R3 in magru. Doh, I went with 22.2R... [16:56:39] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 22.4R3 - https://phabricator.wikimedia.org/T364092#9899880 (10cmooney) [17:31:54] 10Mail, 10fundraising-tech-ops, 06Infrastructure-Foundations, 06SRE: Update fundraising mail / firewall settings to use new production mx-in hosts - https://phabricator.wikimedia.org/T367573#9900161 (10Dwisehaupt) PFW update tracked in T367796. [17:55:48] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:55:48] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed