[01:40:04] eoghan: yeah correct, CNAME means no other records for the same name [02:45:48] FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:29:34] eoghan: can lists.wm.o be a A/AAAA instead of a CNAME alongside the lists1004 ? [06:30:23] basically both pointing to the same IP ? [06:45:48] FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:38:36] PCC looks broken for cumin nodes since https://gerrit.wikimedia.org/r/c/operations/puppet/+/1035474 [07:38:45] example https://gerrit.wikimedia.org/r/c/operations/puppet/+/1047107 [08:01:47] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#9906042 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ayounsi@cumin1002 for host netbox-dev2003.codfw.wmnet with OS bookworm [08:26:32] XioNoX: I tried that but the DNS checker gave out because there was no PTR for that. [08:26:34] I see no CNAMEs here? And all looks good in terms of PTRs too from what I can tell? [08:27:09] eoghan: could it have two PTRs? or that would be problematic? [08:27:58] https://phabricator.wikimedia.org/P65193 [08:28:23] it does have two PTRs, which to my knowledge is no issue for MTAs etc, but I'm out of that business a _long_ time :) [08:28:28] XioNoX: That's the current state, and we've found that in at least one instance we've had to add both names (lists1004.wm.o and lists.wm.o) to the allow rules for things because sometimes it resolves to lists.wm.o and sometimes to lists1004.wm.o, it's not deterministic. [08:28:39] My concern is that it might be seen as somewhat spammy. [08:28:53] the two PTRs is an issue? [08:29:00] I don't really know tbh. [08:29:13] I'm just paranoid about it ending up in a blacklist somewhere [08:29:32] agreed, but let's not stumble in the dark here [08:29:45] the setup looks ok based on my archaic knowledge, but I'd defer to the mail experts [08:30:20] if the two PTRs for the same IP is an issue we can probably remove the Netbox-generated DNS entries for it and do it all manually in the zone file [08:30:38] but it'd probably be best to get confirmation there is a genuine problem first [08:39:18] I think my big worry is that this will become a problem not today or tomorrow but in a week or two. And I'm off next week. [08:39:39] well if you want we can remove the second PTR [08:39:56] I'm just reluctant to do something without any evidence there is an issue [08:40:00] So what I'm going to do is do a brain dump into a ticket along with the potential solutions. I'm not sure how we might go about finding out if it will be a problem or not. [08:40:21] we can consult people in the know and/or look at standards docs [08:40:25] The easy option is to just allocate new service IPs and have them separate. But it's not ideal for a number of reasons. [08:40:46] that is neither easy nor a practical way to resolve this [08:40:51] if two PTRs are an issue we'll remove one of them [08:41:16] I can help with that no probs if we need ot [08:43:06] The issue is that if we only have one PTR, we need to pick which one and neither on their own are correct :D [08:43:43] that remains the case if you use another IP [08:43:58] it would still live on lists1004, but not refer to that host [08:49:34] it would also be a snowflake, the only host without a PTR for its hostname, not managed by our automation, etc [08:50:55] I guess the question is if that's worse than using another public IP, so that it can have dns not managed by our automation? [08:51:25] In the old host, we had two IPs, one with forward/reverse records for lists1004, one for lists.wm.o. Exim was configured to send email from mailman using the lists IP, so that would have an A/MX record and a PTR that matched. Everything else went over the lists1004 IP, and its forward/reverse records matched. [08:51:25] The issue I'm considering is that we might be sending mail and a receiving MTA will do a reverse lookup on the IP that it arrives from, and sometimes get lists.wm.o and sometimes get lists1004.wm.o. [08:51:25] We could remove the lists1004.wm.o PTR record, which would probably raise fewer flags on a receiving MTA, but I'm not sure if/what that would break on our side. Like you say it might not be a problem, I don't really know enough about mail to say for sure. [08:51:29] Hope that makes sense? [08:52:06] Not really, the SPF records for the domain say the only valid senders are mx1001 and mx2001, so I think you're gonna have major issues trying to send mail for the domain from lists.wikimedia.org directly [08:54:29] what's there at the moment in terms of forward/reverse looks perfectly valid to me in terms of standards and best practice - based on a quick search - but there is some possibility that some misconfigured / poorly coded mail server out there will not like the two PTRs [08:54:38] I never touched those and I'll be honest I don't understand how it hasn't broken. I think it's because the spf on wm.o allows our IPs. [08:57:48] Anyway, we'll leave it as is for the moment. I'll write this all up in a ticket in case it somehow causes a problem while I'm away and we can look at it when I'm back. [08:58:05] eoghan: thanks yep put me as a subscriber to the ticket [08:58:47] my thinking on it is it's no point doing something bespoke now, to protect against a potential problem that we have no real evidence will be an issue (and isn't according to the standards, so we're protecting against someone with misconfigured rules or poorly coded software) [08:59:11] if the problem comes up when you're away we can remove the second PTR to resolve short-term and then review what is best [09:01:00] another alternative - were that the case - is to just put lists1004.wikimedia.org in the MX record for lists.wikimedia.org, at which point we only need the Netbox-generated A/AAAA records and all matches [09:02:55] Ah, I tried that -- we need the A/AAAA for the web UI and then our DNS checker complains that we have lists.wm.o without a PTR and we end up back in the same position. If we put the CDN in front of the lists UI then we solve that problem, which might be the best option in the medium term. [09:03:16] to provider some anecdotal evidence: when mx1001/mx2001 were migrated to bullseye some years ago we chose a rather complex method to reimage them in place (we didn't have the VM-capable reimage cookbook back then) since there was concern about IP reputation issues [09:03:31] but since then outgoing mail has partly switched to new IPS [09:03:53] via the new mx-out hosts that Jesse set up and thre doesn't seem to be an issue with that [09:05:23] so likely in the wide scheme of things aspects like SPF/DKIM replaced reliance on IP addresses for that kind of assessment [09:06:12] eoghan: again should all be solvable I think, we can review on task [09:06:46] Hi! I have questions about the installservers, specifically, where a specific file on them comes from (/srv/tftpboot/lpxelinux.0) [09:08:54] The reason for this is that I am trying to test a working hypothesis that the trouble with some Broadcom firmware versions failing to chainload ldlinux.c32 is partially caused by that file (there is a new version in bullseye that may have a patch that fixes stuff). But I have no idea where the existing file comes from, hence my question. [09:08:56] topranks: Yeah, I'll stick that in as an option. [09:10:43] klausman: is gets served from volatile [09:10:54] which is kinda NFS-for-Puppet [09:11:17] for Puppet 7 enabled hosts is gets pulled from /srv/puppet_fileserver/volatile/tftpboot [09:11:22] e.g. on puppetserver1001 [09:11:40] the sub directories in there [09:11:45] like buster-installed [09:11:46] like buster-installer [09:11:56] are from Debian releases [09:12:10] we take the official images and glue the firmware CPIO on top it [09:12:11] I mostly care about /srv/puppet_fileserver/volatile/tftpboot/lpxelinux.0 [09:12:51] that file has simply been there forever, I was once added in the dark ages and I don't think anyone ever updated it [09:12:58] I see. [09:13:02] so if there ws a pxelinux release, we can try updatng it [09:13:52] or stop puppet on the install host for the DC [09:13:58] and manually replace it there for a test [09:14:08] next puppet run should revert it back to the version from volatile [09:14:23] The Debian Changelogs have the last entry in 2020, but since we don't know from whence our file comes. [09:14:46] I may do that (stop puppet, replace file, testboot) this afternoon [09:15:09] If so, I'll do the proper SAL things, of course [09:16:25] sounds good! [09:16:56] if that fixes it, I might also add a text file next to it, explaining whence the binary comes. just for next time ;) [09:18:28] I ran strings on the version we have in volatile and on what is currently in latest Debian: [09:19:23] the current version identifies as 6.0.3 20150819 and the latest one in Debian as 6.0.4 20200816 [09:19:35] so this looks very promising! [09:19:50] ah, strings, I should've thought of that [09:19:52] current version == current version in volatile [09:20:44] I checked the source (as prepared by dpkg-build) and it does contain the patch I mentione in #wikimedia-sre [09:23:58] nice! now if only we also had source for all the others of the Dell/Supermicro boot stack :-) [09:25:53] One blob at a time :) [10:02:22] topranks: https://phabricator.wikimedia.org/T367959 is the writeup here. Please let me know if you think I've missed something or anything's unclear. [10:45:48] FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:18:53] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#9906632 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ayounsi@cumin1002 for host netbox-dev2003.codfw.wmnet with OS bookworm executed with errors: - netbox-de... [12:01:38] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9906713 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d9d9df4b-e647-4f8e-8b55-811d9f86d7d0) set by slyngshede@cumin1002 for 5 days, 0:00:00 on 1 host(s) an... [12:35:24] Well, that wasn't it. Even with the newer lpxelinux.0, the symptoms and failure are the same. I've restored the file to its old glory [12:37:10] One thing of note: I ran tshark whil testing and the host never requests anything but the first file. I.e. it never even tries to get ldlinux.c32, as far as I can tell. This still looks like a bug in the firmware or the lpxelinux.0 file, but I got no clue how to further debug this, except maybe hacking the C source and adding more useful output of what's going on. [12:39:39] klausman: please add your experimentation/findings on https://phabricator.wikimedia.org/T304483 [12:39:45] Will do [12:41:57] and thanks for looking into it, we really need to get it fixed, sticking with an old firmware is problematic [12:43:25] klausman: and yeah that seems to confirm what I saw in https://phabricator.wikimedia.org/T303776#7781198 (download the first file but nothing after) [12:44:13] Which aggravatingly is the same symptom as the bug I mentioned in #wmf-sre [12:44:58] we should still move to a more recent pxelinux independent of the current debugging, I've filed https://phabricator.wikimedia.org/T367970 [12:45:43] For extra brownie points, make puppet install pxelinux and alert on its file and the one in /srv/ being different [12:45:57] klausman: and from https://phabricator.wikimedia.org/T303776#7797564 might have some useful debuging, if you want to dig deeper http://marcoguerri.github.io/2016/03/20/pxeboot-failures-chelsio.html [12:46:39] I think that moves from fetching things via HTTP to using TFTP, right? [12:46:59] which, again, fits with that other bug :-/ [12:48:26] the github.io page is of course 404 [12:48:41] ah, no, c&p skill isue [12:49:24] the link on the task is 404 but I found the new one [12:49:30] ack [12:49:48] (I meant skill issue on my side, not yours ;)) [13:00:48] Ahem, I chose to be a bit cavalier and tried that fix in the blogpost and ... well it works for this SMC machine [13:01:33] klausman: amazing [13:01:39] So it is indeed the mangling of interrupts by lpxelinux.0 and/or the NIC firmware that is causing this [13:02:46] But now the load of the initrd.gz hangs (or is very slow), which just indicates that the whole thing is messy [13:03:29] well, it's still a step forward [13:40:48] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:53:48] XioNoX: I am advertising 10.3.0.5/32 from dns6001 and I can also see the session established on asw1-b12-drmrs. but I can't actually reach the address and I am wondering if there is an intermittent step on Netbox for the anycast address? [13:53:58] sukhe@asw1-b12-drmrs> show route 10.3.0.0/24 [13:54:06] shows me that something is missing for 10.3.05/32 [13:55:08] I thought I will run Capirca to see if something else needs to be added but it's taking forever :> [13:56:05] everything looks good on the host itself and the switch confirms that too so nothing seems to be missing there in the BGP advertisement itself [13:56:37] bast6003:~$ ping 10.3.0.5 works [13:56:50] I have to step away to an apointment, I can have a look in 1h [13:57:04] from non-drmrs it doesn't work though [13:57:05] np! [13:57:36] when you are back, look at: [13:57:37] sukhe@asw1-b12-drmrs> show route 10.3.0.4/32 [13:58:02] vs 10.3.0.5/32 (ntp-a) [14:02:09] actually a better comparison would be 10.3.0.2/32 (ntp, old, working) vs 10.3.0.5/32 (ntp-a, new, not working) [14:05:51] also one more data point: Capirca is timing out [14:06:08] https://netbox.wikimedia.org/extras/scripts/capirca.GetHosts/ this [14:38:47] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:46:30] sukhe: I'm back! [14:46:42] XioNoX: in a meeting, brb in 10 [14:47:20] yes, capirca is not doing well, it needs multiple run to find one that doesn't timeout [14:49:35] no pb, when you're back, has it been deployed on dns6001 only so far? if so the `show route` output is normal [14:49:46] yep [14:49:48] only dns6001 [14:49:52] puppet disabled elsewhere [14:50:50] 10netops, 06Infrastructure-Foundations, 06SRE: Should we channelize unused QSFP28 ports on QFX5120s to provide 'buffer' for 10G upgrades? - https://phabricator.wikimedia.org/T367408#9907245 (10Papaul) Yes i don't think this approach will work for codfw, like @cmooney said: "codfw dc-ops match the switch port... [14:52:24] sukhe: oh I know [14:53:13] sukhe: it was decided to not advertise anycast prefixes from POPs to core [14:54:01] but keep them local and advertise them from core to POPs as last hope for resiliency [14:54:16] and central services like syslog [15:00:51] aaaaa [15:00:55] yes! that makes total sense [15:00:56] thanks! [15:01:11] thanks, rolling it out elsewhere then :) [15:01:23] sukhe: re-reading the inital task, I don't think that conflicts with the plan [15:01:25] yeah [15:01:26] cool [15:01:48] try one in codfw to be sure [15:01:57] yep, I will try in more edge and core [15:15:07] XioNoX: looking good! [15:15:13] rolling it out everywhere else [15:15:24] awesome! [16:16:04] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#9907487 (10ops-monitoring-bot) Deployed netbox-dev to netbox-dev2003.codfw.wmnet with reason: Netbox 4 on netbox-dev2003 - ayounsi@cumin1002 - T336275 [16:34:05] 10netops, 06Infrastructure-Foundations, 06SRE: Should we channelize unused QSFP28 ports on QFX5120s to provide 'buffer' for 10G upgrades? - https://phabricator.wikimedia.org/T367408#9907538 (10cmooney) 05Open→03Resolved a:03cmooney Cool thanks @papaul. I guess we can see how we get on over the nex... [16:43:16] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#9907553 (10ayounsi) Some notes before I forget, to make the `sre.deploy.python-code` work I had to: * On deploy1002: run `sudo chown -R mwdeploy /srv/deployment/netbox-dev/deploy/.gi... [16:55:48] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:08:54] 10netops, 06Infrastructure-Foundations, 06SRE: Should we channelize unused QSFP28 ports on QFX5120s to provide 'buffer' for 10G upgrades? - https://phabricator.wikimedia.org/T367408#9907616 (10Papaul) I don't think we have a lot of servers right that have 10G NIC put using the 1G NIC. Most of the servers... [17:53:47] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:45:01] Interesting feature [21:45:04] https://community.juniper.net/blogs/moshiko-nayman/2024/06/19/junos-symmetrical-load-balancing [21:45:38] Unlikely top-of-racks will arrive with that chipset / ability however [21:55:48] FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed