[01:55:48] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:23:46] FIRING: [3x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:20:48] FIRING: [3x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:09:21] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9902257 (10ABran-WMF) [08:12:52] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9902282 (10ABran-WMF) [08:25:48] FIRING: [3x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:23:47] FIRING: [3x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:22] moritzm, slyngs when you have the chance I'd like your input on https://phabricator.wikimedia.org/T367861, specifically how and when we could depool ldap-ro && ldap-ro-ssl per DC to be able to migrate those two services to IPIP encapsulation [09:35:01] I could provide the puppet changes needed cost free.. courtesy of Traffic :P [09:37:18] sure, I'll have a look later the day [09:37:39] thx <3 [09:40:32] vgutierrez: not even a coffee as payment? [09:41:06] moritzm only have club mate anyway [09:41:07] coffee and beer are always welcome of course :D [10:56:12] Hello, I'm migrating mailman from one host to another, and I've moved the service IPs from the old host to the new but they don't seem to be reachable on the new host. Could someone help me try and figure out where I'm going wrong, please? [10:56:37] Moved 208.80.154.21/32 and 2620:0:861:1:208:80:154:21/128 from lists1001 to lists1004 [10:58:18] eoghan: that IP is in the row A specific public1-a-eqiad VLAN, but lists1004 is in row C [10:58:33] Aha. [10:58:47] Is it possible to move it? [10:59:25] the server or the service addresses? [11:00:06] The addresses from one vlan to another [11:01:09] Guessing not really. Hm. [11:05:57] Can we get a new service IP for that vlan and retire the old lists.wm.o IPs in the row C vlan? [11:13:10] topranks: maybe a netops question ^ [11:13:24] * topranks looking [11:15:26] eoghan: yeah so without knowing too much about how the IP was set up on lists1001 [11:15:40] it's a VM on our ganeti cluster in row A in eqiad [11:16:09] Yeah, it was a secondary IP on the VM. We're moving to a physical host (lists1004) [11:16:17] which for a "normal" public IP would be puit on vlan public1-a-eqiad, which has subnet 208.80.154.0/26 [11:16:22] yeah [11:16:38] so the other way to do a service IP or VIP that moves around is to use BGP on the host [11:17:00] establishing a session to the top-of-rack or core router and announcing the IP as reachable via it [11:17:18] that way decouples the IP from the vlan the host is on... but is more complex [11:17:34] in this case I expect as you say the easiest might be to keep the current setup, but assign a new service IP? [11:17:45] or move the physical host :D [11:17:55] New IP seems the simplest all round. [11:17:57] haha yes or move the physical host :P [11:18:01] caution [11:18:05] Or maybe a really long cable? [11:18:09] the IP is probably listed in other places? [11:18:19] ooh now we are getting creative :P [11:18:38] volans: It's in puppet and dns, not sure if there's anywhere else we'll need to update. It doesn't go through the CDN. [11:18:39] like what happens to the reputation? [11:18:54] IP reputation, dkim and all the other mail-related stuff? [11:18:55] so this VIP is in the dns repo manually it seems [11:18:56] https://netbox.wikimedia.org/ipam/ip-addresses/6658/ [11:19:14] which I guess leaves the actual DNS record change outside of anything we do in Netbox [11:19:37] but yes, there is the dns stuff, plus all that lovely mail spf/dkim or whatever things that I don't really know about [11:20:07] is that IP used also as MX or source IP for outgoing email? [11:20:11] volans: I think we should be ok. I think that mail goes outbound from mx*.wm.o, so reputation should be based on that? I think. [11:20:18] eoghan: for now I'll go ahead and assign a new public IP from public1-c-eqiad for the new host to replace this? [11:20:29] topranks: Yeah, let's go with that for the moment please. [11:20:50] ok [11:22:11] The other option is to just use the existing public IP on the host. I don't know why it has a service IP [11:22:49] no, outbound mail from lists does not use mx*.wm.o [11:23:01] eoghan: that's probably simpler in a few ways alright [11:23:14] but I've no insight into why a separate IP may have been used [11:23:34] if it was for "portability" using a secondary on the vlan doesn't help much :( [11:24:01] taavi: Oh sorry, you're right [11:24:23] In that case why is the spf record set to allow mx* and softfail everything else. [11:24:40] As of now I've assigned new IPs manually: [11:24:40] https://netbox.wikimedia.org/ipam/ip-addresses/17097/ [11:24:45] https://netbox.wikimedia.org/ipam/ip-addresses/17098/ [11:26:15] topranks: Ok. We might just use the host IP, if we do that we'll let you know and you can free them up. Will know in a few. Currently debating this in #wikimedia-sre-collab [11:26:23] ok thanks [11:26:36] yeah if we can avoid using another set of IPs, and having this manual piece of work, it's best [11:26:50] I'll leave them there for now - if you use the host IPs let me know I'll remove the allocation [11:30:39] We're going with the host IPs. Hold off on allocating for a while though, just in case we come across a blocker. Unlikely, but just in case [11:41:22] eoghan: ok great thanks [11:52:12] eoghan: is that temporary or will it be the long term normal ? [11:52:42] Using the host IP? I hope normal, because I believe the plan is to send outbound out through a different mx host eventually. [11:52:50] jhathaway might confirm that [11:52:56] awesome [11:53:04] yeah that would be ideal [11:53:18] my message from earlier was seemingly lost in a netsplit: my best guess is that the secondary address was to keep the same IP for mail deliverability reasons, either to keep a stable address for IP reputation or to ensure the reverse DNS matches the lists.wikimedia.org service name [11:53:29] using a VIP from a "real" subnet is the least prefered option for all the issues you saw here :) [11:54:32] taavi: Yeah, it's not ideal to move. But keeping in that subnet won't work, so I think moving to the host IP for now and trying to get separate mx moving is the best option personally. [12:14:20] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9903045 (10ABran-WMF) [12:16:19] eoghan: be aware that host IP will have a TTL of 1H as opposed to the 5M of the service IP [12:16:55] Yeah, see that now. [12:17:04] at least for anything using the hostname ofc, for dedicated record that can be different [13:14:59] topranks: You can release those new IPs. We'll keep the old service IPs until we've decommed lists1001 which will be in ~2 weeks. [13:15:33] eoghan: ok great, if you can remind me when lists1001 is gone I'll make sure the old ones are cleaned up too [13:15:35] thanks [13:22:45] Will do [13:25:48] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:27:00] eoghan: happy to help out on th email architecture piece, so far in postfix mail work we are using new ips and have not seen any effect on spam reputation [13:52:59] 07Puppet, 06Infrastructure-Foundations: Puppetmaster volatile data not synced to all puppet frontends for a month and a half (2024-04-27 to 2024-06-10) - https://phabricator.wikimedia.org/T367113#9903423 (10CDanis) I think the last step to do here is to validate that any rsync failures will get reported on IRC... [14:22:39] 07Puppet, 06Data-Persistence, 10database-backups: Possible weird interaction between es backups and puppet runs leading to failures - https://phabricator.wikimedia.org/T367882#9903563 (10jcrespo) [14:24:14] 07Puppet, 06Data-Persistence, 10database-backups: Possible weird interaction between es backups and puppet runs leading to failures - https://phabricator.wikimedia.org/T367882#9903572 (10jcrespo) [14:47:49] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9903652 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0039bfdd-84ad-4638-9b4c-c0c23984e401) se... [14:56:53] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9903691 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b16e0477-5d40-4e59-950e-09e82271c822) se... [14:57:44] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9903694 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=80e189d2-8757-4138-ad14-1e0cf5cfbbdb) se... [15:18:40] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9903792 (10cmooney) Switch is back online after upgrade, everything looks good at first glance. [15:24:08] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9903811 (10MatthewVernon) ms swift looks good, thanks. [15:47:08] 07Puppet, 06Infrastructure-Foundations: Puppetmaster volatile data not synced to all puppet frontends for a month and a half (2024-04-27 to 2024-06-10) - https://phabricator.wikimedia.org/T367113#9903900 (10Dzahn) How about adding a MAILTO to the timer and mail a specific list / team / group? I think that ale... [17:25:48] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:25:48] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:10:14] jhathaway: So, interested to hear your opinion. We decided against using the service IP earlier. But now we're in a weird position where we're not sure the best PTR to have. Right now we have lists.wm.o and lists1004.wm.o, which isn't right. But I'm not sure whether selecting just one of those is correct either. [22:10:14] The most correct solution would be going back to service IPs which is doable, but would just having the reverse DNS point to lists1004.wikimedia.org be damaging? [22:13:33] hmm [22:15:42] I would think you would want the mx record for lists.wikimedia.org to be lists1004.wikimedia.org rather than lists.wikimedia.org [22:15:54] it is necessary for the mx record to be lists.wikimedia.org [22:15:57] ? [22:20:18] eoghan: ^ [22:21:30] I don't think it's necessary if we're not using the service IP, no. You might be right, that does seem like the cleanest option that ticks all boxes. [22:21:42] And then CNAME lists.wikimedia.org to lists1004 for the web UI [22:41:06] jhathaway: So something like this: https://gerrit.wikimedia.org/r/c/operations/dns/+/1047192 [22:42:19] eoghan: right, I think that would work [22:42:46] Yeah, I think I agree, this is the simplest solution. [22:43:01] I'll get some more eyes on it and deploy tomorrow, it's late here. [22:43:27] sounds good [22:44:20] Although the checker is complaining about `23:41:55 error: Name 'lists.wikimedia.org.': CNAME not allowed alongside other data`. Does this mean you can't have a CNAME and an MX on the same record? [22:45:48] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:51:36] jhathaway: Updated there to have an A/AAAA for lists.wm.o pointing to the same IPs as lists1004, but the mx doesn't point to lists.wm.o, it points to lists1004. [22:54:11] Hm, actually no, it still complains because there's no PTR records for lists.wm.o. So bit of a no win there. [23:14:27] Service IPs might be the best option ):