[01:03:37] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:55:49] (PuppetZeroResources) firing: Puppet has failed generate resources on puppetmaster1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:10:49] (PuppetZeroResources) resolved: Puppet has failed generate resources on puppetmaster1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:48:57] (SystemdUnitFailed) firing: (2) update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:48:37] (SystemdUnitFailed) firing: (2) update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:33:50] (PuppetZeroResources) firing: Puppet has failed generate resources on seaborgium:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:43:50] (PuppetZeroResources) resolved: Puppet has failed generate resources on seaborgium:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:49:49] (PuppetZeroResources) firing: Puppet has failed generate resources on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:04:49] (PuppetZeroResources) resolved: Puppet has failed generate resources on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:48:37] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:37:51] 10netops, 10DC-Ops, 10Data-Persistence, 10Infrastructure-Foundations, and 5 others: Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547#9566904 (10Marostegui) What is the idea? Will codfw remain depooled for a week or two? For DBAs this would be good so we can perform s... [10:43:06] 10netops, 10DC-Ops, 10Data-Persistence, 10Infrastructure-Foundations, and 6 others: Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547#9566921 (10Marostegui) [10:50:05] 10netops, 10DC-Ops, 10Data-Persistence, 10Infrastructure-Foundations, and 6 others: Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547#9566974 (10Marostegui) [12:47:47] 10netops, 10DC-Ops, 10Data-Persistence, 10Infrastructure-Foundations, and 6 others: Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547#9567401 (10Marostegui) I'd love if it can be a bit longer than 7 days as we can do lots of operational maintenance and save a bunch of... [13:01:53] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9567450 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1034.eqiad.wmnet with OS... [13:42:25] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9567605 (10aborrero) [13:47:12] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9567631 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1034.eqiad.wmnet with OS book... [13:48:37] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:54:39] 10netops, 10Infrastructure-Foundations, 10SRE: Control IPv6 RA generation on core routers - https://phabricator.wikimedia.org/T358220#9567655 (10cmooney) p:05Triage→03Low [13:55:15] 10netops, 10Infrastructure-Foundations, 10SRE: Control IPv6 RA generation on core routers - https://phabricator.wikimedia.org/T358220#9567676 (10cmooney) [13:55:23] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544#9567677 (10cmooney) [15:05:40] 10netops, 10DC-Ops, 10Data-Persistence, 10Infrastructure-Foundations, and 6 others: ☂️ Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547#9568015 (10jijiki) [15:37:35] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544#9568184 (10cmooney) [15:38:41] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A8 from asw-a8-codfw to lsw1-a8-codfw - https://phabricator.wikimedia.org/T355874#9568181 (10cmooney) 05Open→03Resolved a:03cmooney Closing this, thanks all for the help! [15:42:42] 10netops, 10DC-Ops, 10Data-Persistence, 10Infrastructure-Foundations, and 6 others: ☂️ Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547#9568203 (10jijiki) [15:43:30] 10netops, 10DC-Ops, 10Data-Persistence, 10Infrastructure-Foundations, and 6 others: ☂️ Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547#9543213 (10jijiki) [15:48:44] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-swift-storage, 10ops-codfw: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw - https://phabricator.wikimedia.org/T355868#9568216 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=93a3c441-2097-4840-a202-5694f260c1b5... [15:55:45] anyone working with sretest1001? I need to do some tests with the firmware upgrade cookbook and I might need to reboot the host [15:56:13] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-swift-storage, 10ops-codfw: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw - https://phabricator.wikimedia.org/T355868#9568300 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=90864fe1-6d91-45db-a2a5-2bb22463c114... [15:59:33] volans: not I [16:03:31] * volans using the silence as approval :D [16:08:17] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-swift-storage, 10ops-codfw: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw - https://phabricator.wikimedia.org/T355868#9568400 (10cmooney) All hosts moved successfully and back responding to pings. [16:13:02] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-swift-storage, 10ops-codfw: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw - https://phabricator.wikimedia.org/T355868#9568428 (10MatthewVernon) Swift is back OK, thanks. [16:32:56] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9568564 (10cmooney) p:05Triage→03Medium [16:33:38] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9568586 (10cmooney) [16:33:47] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-swift-storage, 10ops-codfw: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw - https://phabricator.wikimedia.org/T355868#9568588 (10Fabfur) cp2031 and cp2032 are ok and repooled [16:39:09] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9568622 (10cmooney) All interfaces on asw-a-codfw are set to 'disabled' apart from the uplinks to ssw's, and no mac's learnt on SSW side so proceeding to delete those links... [17:02:01] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw, 10Patch-For-Review: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9568765 (10cmooney) Ok I've removed the configuration for the ESI-LAG between the codfw spine switches and asw-a-codfw both sides now. DC-Ops you can... [17:02:52] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw, 10Patch-For-Review: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9568799 (10cmooney) [17:48:38] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:14:30] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: Cookbook sre.hardware.upgrade-firmware fails to get firmwares from Dell's website - https://phabricator.wikimedia.org/T357756#9569312 (10Volans) I've tested that the cookbook works fine with the existing cached firmwares on the cumin... [18:19:52] 10netops, 10Infrastructure-Foundations, 10SRE: Do we need to generate aggregates for LVS service IP ranges? - https://phabricator.wikimedia.org/T350354#9569320 (10cmooney) 05Open→03Resolved a:03cmooney >>! In T350354#9312533, @BBlack wrote: > I don't suspect it serves any real purpose at present, unles... [21:48:38] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed