[07:17:58] 10netops, 06Infrastructure-Foundations, 06serviceops, 06Traffic: weighted maglev viability for low-traffic services - https://phabricator.wikimedia.org/T368545#9929335 (10ayounsi) Strictly on the network side, there is no blocker one way or the other. I think I miss some context, what's the current low-tr... [07:23:50] 07Puppet, 06Data-Persistence, 10database-backups: Possible weird interaction between es backups and puppet runs leading to failures - https://phabricator.wikimedia.org/T367882#9929348 (10jcrespo) p:05Low→03Medium I got another error at backup2002 (es5): ` 2024-06-26 17:07:31 [ERROR] - Could not read data... [07:30:32] FIRING: SystemdUnitFailed: cfssl-ocsprefresh-aux_front_proxy.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:36:22] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9929380 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1002 for host idp-test1004.wikimedia.org with OS bookworm [07:39:29] 10CAS-SSO, 06Infrastructure-Foundations: Update CAS to 6.6.15.2 - https://phabricator.wikimedia.org/T368503#9929382 (10MoritzMuehlenhoff) p:05Triage→03High [07:44:23] 10CAS-SSO, 06Infrastructure-Foundations: Update CAS to 6.6.15.2 - https://phabricator.wikimedia.org/T368503#9929398 (10MoritzMuehlenhoff) 05Open→03Resolved [08:29:14] RESOLVED: SystemdUnitFailed: cfssl-ocsprefresh-aux_front_proxy.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:48:34] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9929557 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1002 for host idp-test1004.wikimedia.org with OS bookworm executed with errors: -... [09:16:29] 10netops, 06Infrastructure-Foundations, 06serviceops, 06Traffic: weighted maglev viability for low-traffic services - https://phabricator.wikimedia.org/T368545#9929623 (10Vgutierrez) >>! In T368545#9929335, @ayounsi wrote: > I think I miss some context, what's the current low-traffic setup ? Usually servic... [09:21:22] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9929646 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1002 for host idp-test2004.wikimedia.org with OS bookworm [09:39:03] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9929683 (10ABran-WMF) [10:50:36] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9929996 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1002 for host idp-test2004.wikimedia.org with OS bookworm completed: - idp-test200... [10:50:49] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9930000 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1002 for host idp-test1004.wikimedia.org with OS bookworm [11:24:38] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9930092 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1002 for host idp-test1004.wikimedia.org with OS bookworm completed: - idp-test100... [11:47:48] FIRING: PuppetFailure: Puppet has failed on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:52:49] FIRING: [2x] PuppetFailure: Puppet has failed on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:31:34] ^^ that's 3941eccfd4 [12:35:33] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:39:14] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:49] RESOLVED: [2x] PuppetFailure: Puppet has failed on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:21:49] slyngs: o/ [13:22:30] qq - I am playing with debmonitor releases, is it possible that we are missing the last release tag? [13:22:36] v0.4.0 [13:46:14] 10netops, 06Infrastructure-Foundations, 06serviceops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9930546 (10CDanis) Could be convinced otherwise, but I'm generally in favor of the MSS clamping option -- we know it works and the tradeoff... [14:00:13] 10netops, 06Infrastructure-Foundations, 06serviceops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9930584 (10Joe) I'd go ahead and take a step back: why do we need to switch to IPIP encapsulation for backend services? Is there a compell... [14:03:29] 10netops, 06Infrastructure-Foundations, 06serviceops, 06Traffic: weighted maglev viability for low-traffic services - https://phabricator.wikimedia.org/T368545#9930604 (10Joe) It is pretty clear to me that the only way to have fair load balancing with `maglev` is if we do the consistent hashing using the r... [14:20:47] 10netops, 06Infrastructure-Foundations, 06serviceops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9930681 (10akosiaris) T352956 is related (possibly a duplicate) and I 've mulling over it for a few months now. I think we need to have a l... [14:34:39] moritzm just saw the patch adding ripgrep. Many thanks! [14:34:57] yw :-) [14:35:22] 10netops, 06Infrastructure-Foundations, 06serviceops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9930740 (10Vgutierrez) >>! In T368544#9930584, @Joe wrote: > I'd go ahead and take a step back: why do we need to switch to IPIP encapsulat... [14:36:58] CRs to make Homer compatible with Netbox 4 are ready and tested... Last step, cookbooks... [14:37:53] nice! [14:46:41] 10netops, 06Infrastructure-Foundations, 06serviceops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9930773 (10Joe) >>! In T368544#9930740, @Vgutierrez wrote: >>>! In T368544#9930584, @Joe wrote: >> I'd go ahead and take a step back: why d... [14:46:57] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9930774 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=66810f76-0e2d-43f3-8c96-bbfe4e6a7aee) se... [14:52:20] 10netops, 06Infrastructure-Foundations, 06serviceops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9930800 (10BBlack) For more context: eventually our Katran-based Liberica balancer will replace pybal/LVS. The Katran one has to use IPIP,... [14:57:10] 10netops, 06Infrastructure-Foundations, 06serviceops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9930822 (10Vgutierrez) theoretically speaking we could keep low-traffic on liberica/IPVS (instead of liberica/Katran) to be able to get rid... [14:57:58] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9930839 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2863d158-d71c-4317-a811-4dd3cb8e6e72) se... [14:58:14] oh ripgrep is available? 🎉 thanks [14:58:49] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9930845 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=bd008f08-7b85-4b69-ba4e-5d84a9307d79) se... [15:15:32] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:17:40] 10netops, 06Infrastructure-Foundations, 06serviceops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9930949 (10cmooney) >>! In T368544#9930584, @Joe wrote: > I'd go ahead and take a step back: why do we need to switch to IPIP encapsulation... [15:19:53] 10netops, 06Infrastructure-Foundations, 06serviceops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9930977 (10Joe) >>! In T368544#9930822, @Vgutierrez wrote: > theoretically speaking we could keep low-traffic on liberica/IPVS (instead of... [15:20:49] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9930989 (10cmooney) Upgrade completed, all looking good network-wise. [15:40:53] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9931141 (10Eevans) >>! In T365988#9930989, @cmooney wrote: > Upgrade completed, all looking good network-wise. Than... [15:41:31] 10netops, 06Infrastructure-Foundations, 06serviceops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9931142 (10Vgutierrez) >>! In T368544#9930977, @Joe wrote: > oh I agree 100% with this. My doubts were specifically for switching to katran... [16:14:14] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:47:08] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 10vrts: generate_vrts_aliases failing on mx-in1001 - https://phabricator.wikimedia.org/T368257#9931425 (10jhathaway) I was able to capture this traceback: ` Traceback (most recent call last): File "/home/jhathaway/./vrts_aliases", line 162,... [17:00:16] would love a sanity check on the dns change, if anyone has a moment, https://gerrit.wikimedia.org/r/c/operations/dns/+/1050426 [17:00:59] jhathaway: maybe consider a lower TTL in case that needs an immediate rollback? [17:02:52] taavi: yeah that is a good thought, I was going to do that originally, but changed my mind [17:04:20] although I guess you could just drop the inbound firewall rule in case of an emergency and it'd have a similar effect [17:04:59] hopefully [17:05:18] but I think lowering is probably prudent [17:23:45] updated taavi