[03:18:33] (SystemdUnitFailed) firing: httpbb_hourly_appserver.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:18:33] (SystemdUnitFailed) resolved: httpbb_hourly_appserver.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:08:05] 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Volans) @cmooney adding a note here to not forget. We'll need to check how it will work for Ganeti VMs, in particular the makevm cookbook has a knowledge of DCs that hav... [08:10:04] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Spicerack: add distributed locking support - https://phabricator.wikimedia.org/T341973 (10Volans) For the record as Giuseppe is out, I had a chat with @CDanis going over the plan and numbers and we didn't find anything worrisome or... [09:26:48] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) This is the patch to enable the single NIC setup on ceph nodes: https://gerrit.wikimedia.org/r/c/operations/puppet/+/856675/ Is marked as abando... [09:28:49] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 3 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10dcaro) Unfortunately, it seems that the cluster has grown in the last few days :/, as draining the last 21 osd daemons would... [09:52:55] it looks like codfw ganeti cluster B has some issues. At least some vms are offline (releases2003, kafkamon2003 and mx2001). [09:54:22] (SystemdUnitFailed) firing: netbox_ganeti_codfw_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:06:32] yeah, there's some DRBD errors in thr kernel logs, I'm currently rebooting it [10:08:33] (SystemdUnitFailed) resolved: netbox_ganeti_codfw_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:18:42] (SystemdUnitFailed) firing: (2) ifup@ens13.service Failed on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:19:22] (SystemdUnitFailed) resolved: (2) ifup@ens13.service Failed on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:22:39] great thanks a lot! I can reach the vm again [12:39:02] 10netops, 10Infrastructure-Foundations, 10SRE: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10cmooney) p:05Triage→03Low [12:51:17] 10netops, 10Infrastructure-Foundations, 10SRE: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10cmooney) [12:51:57] 10netops, 10Infrastructure-Foundations, 10SRE: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10cmooney) [14:17:51] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Cabling for Eqiad racks E5-8 and F5-8 - https://phabricator.wikimedia.org/T334231 (10cmooney) @Jclark-ctr let me know when you have time to look at this now that the optics have been received. thanks :) [15:59:33] 10CAS-SSO, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10collaboration-services, and 2 others: Add GitLab to offboarding workflow - https://phabricator.wikimedia.org/T339843 (10LSobanski) LDAP sync is now implemented but some manual permissions remain in place so this is still a valid request. [16:27:19] 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, and 2 others: validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10lmata)