[02:15:33] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:15:33] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:08:37] 10netops, 10Data-Persistence, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10Marostegui) [07:13:55] 10netops, 10Data-Persistence, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10Marostegui) Database hosts are depooled - @cmooney confirm if you will downtime them or if I should do it myself [10:15:33] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:21:24] 10netops, 10Data-Persistence, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10cmooney) >>! In T355549#9487462, @Marostegui wrote: > Database hosts are depooled - @cmooney confirm if you wi... [10:27:37] 10netops, 10Data-Persistence, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10Marostegui) Great thank you! [11:18:52] these alerts for os-reports are a genuine issue, they are related to the patches to drop system::role (original work by John that I picked up) [11:19:11] this needs some fixes to the script, I'll look into these tomorrow in mode depth [11:35:21] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10cmooney) p:05Triage→03Medium [11:35:39] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10cmooney) [11:35:47] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [11:36:52] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T355861 (10cmooney) p:05Triage→03Medium [11:37:51] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 (10cmooney) p:05Triage→03Medium [11:38:01] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 (10cmooney) [11:38:17] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T355861 (10cmooney) [11:39:04] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10cmooney) p:05Triage→03Medium [11:39:16] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [11:39:23] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10cmooney) [11:40:28] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T355861 (10cmooney) [11:40:33] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [11:41:03] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [11:41:11] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 (10cmooney) [11:42:30] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10cmooney) p:05Triage→03Medium [11:42:39] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10cmooney) [11:42:47] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [11:43:28] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 (10cmooney) p:05Triage→03Medium [11:43:35] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 (10cmooney) [11:43:43] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [11:45:24] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A7 from asw-a7-codfw to lsw1-a7-codfw - https://phabricator.wikimedia.org/T355867 (10cmooney) p:05Triage→03Medium [11:45:31] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A7 from asw-a7-codfw to lsw1-a7-codfw - https://phabricator.wikimedia.org/T355867 (10cmooney) [11:45:39] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [11:46:58] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw - https://phabricator.wikimedia.org/T355868 (10cmooney) p:05Triage→03Medium [11:47:07] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [11:47:15] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw - https://phabricator.wikimedia.org/T355868 (10cmooney) [11:52:12] 10netops, 10Infrastructure-Foundations, 10SRE: Create netbox script to support moving a cable from one network port to another - https://phabricator.wikimedia.org/T355869 (10cmooney) p:05Triage→03Low [11:53:14] 10netops, 10Infrastructure-Foundations, 10SRE: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870 (10cmooney) p:05Triage→03Medium [11:53:20] 10netops, 10Infrastructure-Foundations, 10SRE: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870 (10cmooney) [11:53:28] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [11:54:31] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack B6 from asw-b6-codfw to lsw1-b6-codfw - https://phabricator.wikimedia.org/T355871 (10cmooney) p:05Triage→03Medium [11:54:40] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack B6 from asw-b6-codfw to lsw1-b6-codfw - https://phabricator.wikimedia.org/T355871 (10cmooney) [11:54:48] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [11:55:29] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack B7 from asw-b7-codfw to lsw1-b7-codfw - https://phabricator.wikimedia.org/T355872 (10cmooney) p:05Triage→03Medium [11:55:40] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [11:55:48] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack B7 from asw-b7-codfw to lsw1-b7-codfw - https://phabricator.wikimedia.org/T355872 (10cmooney) [11:56:31] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873 (10cmooney) p:05Triage→03Medium [11:56:42] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [11:56:50] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873 (10cmooney) [11:57:35] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [12:01:00] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A8 from asw-a8-codfw to lsw1-a8-codfw - https://phabricator.wikimedia.org/T355874 (10cmooney) p:05Triage→03Medium [12:01:09] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A8 from asw-a8-codfw to lsw1-a8-codfw - https://phabricator.wikimedia.org/T355874 (10cmooney) [12:01:17] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [12:23:03] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [12:26:25] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [12:26:37] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A8 from asw-a8-codfw to lsw1-a8-codfw - https://phabricator.wikimedia.org/T355874 (10cmooney) [14:15:33] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:21:52] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [15:39:07] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [15:46:40] 10netops, 10Data-Persistence, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=34ae871a-7149-43dd-8180-02ddd5b8c983) set by... [15:50:58] 10netops, 10Infrastructure-Foundations, 10SRE: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977 (10cmooney) Just an update here, the restriction still exists however I think I know how I went wrong. In order for the irb interface to be "up" the associated vlan ne... [15:57:38] 10netops, 10Data-Persistence, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e2f0518c-1df7-4528-89a1-5f2b248a7520) set by... [16:35:55] 10netops, 10Data-Persistence, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10cmooney) Migration done! Serious props to @papaul and @Jhancock.wm for the smooth and super-fast execution!... [16:55:00] volans: quick one about downtime if you're about [16:55:19] we hit a scenario this evening I hadn't anticipated [16:55:57] u.random had one of the restbase servers we moved downtimed already - for an extended period/maintenance nothing to do with our network move [16:56:27] I had downtimed everything in the row before we did our maintenance - with rack location selector as you know [16:56:56] When we were done I removed the downtime for everything in the row - which obviously undowntimed the restbase box and then alerts fired as some services were disabled and stuff [16:57:20] so I gotta think of that scenario for future moves [16:57:28] My question is basically this: [16:57:35] If a host already has a downtime for 7 days set [16:57:37] 10netops, 10Data-Persistence, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10klausman) Nice work. On our machine (ml-serve2002), it was but four seconds: `[Thu Jan 25 16:09:14 2024] tg3... [16:57:46] And I run a command to downtime for 1 hour [16:57:52] what happens at the end of the hour? [17:00:24] topranks: sorry was afk, reading backlog [17:01:32] topranks: in both icinga and alertmanager you can have multiple downtime/silences that matches a given alert and it will fire only if no downtime/silence matches [17:01:47] ah ok nice [17:01:48] so yes, if you had let the downtime expire by itself it would have left the 7d one and it would have not fired [17:01:54] ok cool [17:01:59] but you'll be blind for that amount of time [17:02:09] inded, but still probably safer [17:02:21] and I've a good idea how long the changes take after today [17:02:33] unforunately neither icinga or AM have a native way to do what you need, but we could use the descripion/comment to do something like our disable/enable puppet [17:02:35] 19.345 seconds or so :) [17:02:39] rotfl [17:02:46] does the cookbook support fractions of a second? [17:02:47] :D [17:02:54] :-P [17:03:36] Slightly longer downtime is probably not an issue, better than force-removing and causing this [17:03:40] thanks for the info! [17:03:54] yeah that's an unfortunate use case [18:05:16] 10netops, 10Data-Persistence, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10Marostegui) @Jhancock.wm @papaul <3 [18:14:29] XioNoX: I reimaged durum to see if the new bird changes hold up on initial puppetization. all good :) [18:15:34] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:03:22] 10netbox, 10Infrastructure-Foundations: Netbox MoveServersUplinks script doesn't handle trunked ports correctly - https://phabricator.wikimedia.org/T355899 (10Peachey88) [21:27:43] 10netbox, 10Infrastructure-Foundations: Netbox MoveServersUplinks script doesn't handle trunked ports correctly - https://phabricator.wikimedia.org/T355899 (10cmooney) So I had a look at this thinking it would be a simple omission or small bug. But I'm totally lost. If you look at the version now on netbox-n... [21:34:49] 10netbox, 10Infrastructure-Foundations: Netbox MoveServersUplinks script doesn't handle trunked ports correctly - https://phabricator.wikimedia.org/T355899 (10cmooney) FWIW I also tested it passing an input of a switch I'd never run it before in test mode on. i.e. ran it against something with 'commit' ticked... [21:59:27] 10netbox, 10Infrastructure-Foundations: Netbox MoveServersUplinks script doesn't handle trunked ports correctly - https://phabricator.wikimedia.org/T355899 (10cmooney) Also might be of relevance. Running earlier on production netbox it executed the code as [[ https://gerrit.wikimedia.org/r/plugins/gitiles/ope... [22:15:34] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed