[01:53:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-memcached-exporter.service on idp1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:13:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mcrouter-exporter.service on idp1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:13:25] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-mcrouter-exporter.service on idp1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:43:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-mcrouter-exporter.service on idp1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:41:11] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10295141 (10Papaul) [06:46:27] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10295142 (10Papaul) There will be some maintenance in magru sometime next week and the site will be de-pool we can take advantage of this maintenance window to upgrade the router th... [08:12:22] 10netops, 10SRE-tools, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485#10295209 (10Volans) >>! In T336485#10294334, @cmooney wrote: > I don't see that forced in /etc/ssh/ssh_config though. Also w... [08:18:27] hey folks! [08:18:59] as fyi the AUX cluster is now running containerd, so in case you want to inspect containers/images on the node you'll need to use `nerdctl` [08:19:16] it is very similar to the docker command [08:19:30] so now we can rightfully say we're nerd? :-P [08:20:21] I think the train that said we were not left the station years ago :D [08:35:03] now we can control the nerds though [08:35:04] \o/ [08:35:42] Don't tell my wife that nerdctl exists [08:35:54] haha [08:44:12] lol [09:43:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-mcrouter-exporter.service on idp1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:51:27] ^ leftover of removing memcached support from the IDPs, I've just cleaned it up [09:53:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-mcrouter-exporter.service on idp1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:23:13] 10Mail: Create user preference to receive change notification emails for bot edits - https://phabricator.wikimedia.org/T358087#10295847 (10Urbanecm_WMF) [13:53:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mcrouter-exporter.service on idp2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:58:25] RESOLVED: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mcrouter-exporter.service on idp2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:42:19] 10netops, 06Infrastructure-Foundations: Testing liberica with ncredir@eqiad - https://phabricator.wikimedia.org/T378453#10296381 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Thx @ayounsi & @cmooney. lvs1013 running liberica is now the primary load balancer for ncredir@eqiad [14:50:36] 10netops, 06Infrastructure-Foundations, 06Traffic: BGP settings for liberica - https://phabricator.wikimedia.org/T379164 (10Vgutierrez) 03NEW [14:50:36] 10netops, 06Infrastructure-Foundations, 06Traffic: BGP settings for liberica - https://phabricator.wikimedia.org/T379164#10296452 (10Vgutierrez) p:05Triage→03Medium [15:17:15] 10netops, 06Infrastructure-Foundations, 06Traffic: BGP settings for liberica - https://phabricator.wikimedia.org/T379164#10296519 (10cmooney) I personally don't think the current config is a bad thing to have in general (we have a lower pref/normal pref/higher pref community defined). None of the community... [15:23:45] elukey: how goes the efi testing? [15:27:14] jhathaway: o/ good! There were two little things to fix in the reimage cookbook but the rest seems working, there is a puppet-issue (unrelated to efi) for ms-be to sort out yet [15:27:21] but the recipe seems to work etc.. [15:27:54] thanks, I saw the couple of fixes roll by, thanks for making those [15:35:50] the rest worked nicely [15:36:19] only one time I noticed "media not present" when booting after d-i, ending up in a second PXE boot and d-i install [15:36:42] no idea why, but there was the issue of not properly recognizing if d-i was running (by the cookbook) [15:44:07] yeah, that is strange [15:44:32] there is also the operator variable to keep in mind :D [15:45:26] :D [16:44:21] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10296813 (10Papaul) [16:45:42] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10296815 (10Papaul) [16:45:58] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10296817 (10Papaul) [17:07:26] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10296916 (10ssingh) `cr1-eqiad` is stated for Nov 13 but note that T376737 is also scheduled for that period (Nov 13, 8 CT) and it might make tricky for both `magru` and `eqiad` to... [17:12:45] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10296934 (10Papaul) [17:13:40] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10296938 (10Papaul) @ssingh thanks i forgot about the 13th I update the dates. [17:13:53] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10296937 (10Joe) I see there is a maintenance planned for codfw now, and that the plan is to depool the datacenter. Does this mean we're doing a datacenter switchover? Because oth... [17:18:56] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10296958 (10akosiaris) > Upgrades should follow the standard process The standard process docs are outdated I fear. > Depool site (optional) > (optional) if codfw, drain mw traff... [17:18:57] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10296959 (10Papaul) >>! In T364092#10296937, @Joe wrote: > I see there is a maintenance planned for codfw now, and that the plan is to depool the datacenter. Does this mean we're do... [17:22:02] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10296980 (10Papaul) Thanks @akosiaris @Joe we can hold back on codfw for now and work on eqiad. when we switch back to eqiad we can schedule the upgrade for codfw. [17:22:34] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10296985 (10Papaul) [17:28:03] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10297017 (10akosiaris) >>! In T364092#10296980, @Papaul wrote: > Thanks @akosiaris @Joe we can hold back on codfw for now and work on eqiad. when we switch back to eqiad we can sche... [17:30:46] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10297026 (10cmooney) >>! In T364092#10296958, @akosiaris wrote: > codfw will be the primary during that set of dates, it should NOT be depooled. Agreed. It should also be possible... [17:33:18] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10297053 (10cmooney) @Jclark-ctr could you also let me know what ports on the fmsw these two were plugged i... [18:28:38] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10297301 (10cmooney) >>! In T377381#10250655, @Jgreen wrote: > There are 6 servers being replaced: > {T3695... [19:21:06] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10297478 (10cmooney) All, just to be aware I hit another snag this evening which may be problematic. When... [20:09:07] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10297583 (10Dwisehaupt) > Thanks @Jgreen . Looking at the existing ports on the switch I think it might ma... [20:22:19] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10297595 (10Jclark-ctr) @cmooney replaced 1g dac cables with sfpt and cat6 cables. These two switches ha... [20:54:45] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10297675 (10cmooney) >>! In T377381#10297595, @Jclark-ctr wrote: > These two switches have been removed fro... [21:45:25] FIRING: SystemdUnitFailed: uwsgi-netbox.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:50:25] FIRING: [7x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:55:25] FIRING: [9x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:00:25] FIRING: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:05:25] FIRING: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:10:25] FIRING: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:15:25] FIRING: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:20:25] RESOLVED: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed