[00:11:25] FIRING: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:16:25] FIRING: [3x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:21:25] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:26:25] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:30:18] 10netops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Test prototype fundraising pybal replacement based on haproxy + anycast-healthchecker. - https://phabricator.wikimedia.org/T373942#10159245 (10Dwisehaupt) T375142 - Expanded the pfw and iptables ranges for prometheus collection so that we can hi... [00:31:25] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:36:25] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:41:25] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:46:25] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:51:25] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:56:25] RESOLVED: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:29:34] FIRING: DiskSpace: Disk space seaborgium:9100:/ 5.3% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=seaborgium - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [03:01:25] FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:46:25] FIRING: [2x] SystemdUnitFailed: prometheus-dpkg-success-textfile.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:18:14] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151 (10Papaul) 03NEW [04:18:37] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10159412 (10Papaul) [04:23:29] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10159413 (10Papaul) @Jhancock.wm if you have some time this week or next week can you please check in rack C8 all the servers that have only... [04:23:36] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10159414 (10Papaul) p:05Triage→03Medium [04:24:55] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10159415 (10Papaul) [04:49:28] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10159423 (10Papaul) [05:02:48] FIRING: PuppetFailure: Puppet has failed on seaborgium:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:29:49] FIRING: DiskSpace: Disk space seaborgium:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=seaborgium - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [06:41:15] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10159452 (10ABran-WMF) all actionnable machines are ready to be depooled. I'll start depooling 20/15min before 16:00 UTC [06:47:39] !log cleanup some old Bacula restores (4G) on seaborgium [06:47:40] moritzm: Not expecting to hear !log here [06:49:34] RESOLVED: DiskSpace: Disk space seaborgium:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=seaborgium - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:12:49] RESOLVED: PuppetFailure: Puppet has failed on seaborgium:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:16:25] FIRING: [2x] SystemdUnitFailed: prometheus-dpkg-success-textfile.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:16:34] If someone have a bit of time, could I get a quick review on https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/1067960 (NOOP, just adding types) [07:25:04] lgtm, left a comment but if you tested go ahead! [07:25:25] techinically it is not a no-op, the mac_address defaults moves from None to '' :) [07:26:25] RESOLVED: [2x] SystemdUnitFailed: prometheus-dpkg-success-textfile.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:38:16] elukey: yeah true, operationally a noop, but not the code itself. I did double check that point though, so we should be good. [07:38:18] thx! [07:44:14] XioNoX: if you're taking care of netbox4 upgrade follow ups, there is the double code to support the old version that should be removed. ;) [07:45:01] volans: maybe we should keep it in case we need to rollback :) [07:45:11] don't even try [07:46:35] volans: I think it's only this https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1050453 but let me know if you can think of other places [07:47:00] most likely yes [07:47:15] remember that we didn't add test coverage for the netbox 3 part? [07:47:23] so I notice it every time I run the unit tests :-P [07:49:01] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10159580 (10ops-monitoring-bot) Draining ganeti2018.codfw.wmnet of running VMs [07:49:34] ohh right, well, that was a good reminder then [07:49:45] will send a patch today :) [07:50:55] <3 thx [08:28:10] 10netops, 06Infrastructure-Foundations: Transient DOWN alert on cr2-magru - https://phabricator.wikimedia.org/T374401#10159658 (10ayounsi) There was indeed a connectivity "blips", so they're not monitoring issues. That's for Sept 9th, where we can see it was only from eqiad : https://grafana.wikimedia.org/d/m1... [08:39:17] waiting for CI but `spicerack/netbox.py 250 0 78 0 100%` - https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1074100 [08:39:39] already reviwed, waiting CI to press send [09:19:30] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10159837 (10cmooney) 05Open→03Resolved a:03cmooney [10:04:16] 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#10159959 (10MoritzMuehlenhoff) [10:07:14] 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Spicerack, and 2 others: Migrate puppet merges to a cookbook - https://phabricator.wikimedia.org/T366355#10159969 (10MoritzMuehlenhoff) We have migrated puppet merges to puppetserver1001, so this is not a blocker anymore to the shutdown... [10:07:34] 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Spicerack, and 2 others: Migrate puppet merges to a cookbook - https://phabricator.wikimedia.org/T366355#10159971 (10MoritzMuehlenhoff) [10:11:01] 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#10159988 (10fgiunchedi) [10:22:01] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10160016 (10MoritzMuehlenhoff) ganeti2018 is drained [10:25:46] 10netops, 06Infrastructure-Foundations, 06SRE: Netbox automation to move selected hosts from ASW to LSW - https://phabricator.wikimedia.org/T370846#10160028 (10cmooney) 05Open→03Resolved In the end we got away without needing this, thanks to data-persistence. I'll close for now and we can re-open if... [10:39:46] moritzm, elukey: is this a good time for me to run debdeploy for python-wmflib? [10:40:14] good for me [10:43:17] ack, go ahead [10:44:18] thanks! [10:55:06] all done, thanks! [11:46:23] 10CFSSL-PKI, 10netops, 06Infrastructure-Foundations: sre.network.tls cookbook - CFSSL error: bad request - https://phabricator.wikimedia.org/T375179 (10ayounsi) 03NEW p:05Triage→03High [11:46:40] hello, is there any cfssl expert who could help troubleshot https://phabricator.wikimedia.org/T375179 ? [11:53:17] 10CFSSL-PKI, 10netops, 06Infrastructure-Foundations: sre.network.tls cookbook - CFSSL error: bad request - https://phabricator.wikimedia.org/T375179#10160355 (10ayounsi) [11:53:18] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add cloudsw to gnmic interface stats collection - https://phabricator.wikimedia.org/T365012#10160354 (10ayounsi) [11:55:25] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add cloudsw to gnmic interface stats collection - https://phabricator.wikimedia.org/T365012#10160375 (10cmooney) 05Resolved→03Open [11:56:40] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add cloudsw to gnmic interface stats collection - https://phabricator.wikimedia.org/T365012#10160348 (10ayounsi) a:05cmooney→03ayounsi Blocked on {T365012} to be able to renew the certs. Other than that, manually tested and works as exp... [11:56:52] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add cloudsw to gnmic interface stats collection - https://phabricator.wikimedia.org/T365012#10160350 (10cmooney) 05Open→03Resolved This has been enabled following the cloudsw upgrades. (see https://gerrit.wikimedia.org/r/c/operations/p... [13:35:30] XioNoX: o/ going to check in a bit [13:35:37] <3 [13:57:31] XioNoX: is it ok if I retry the cookbook? [13:57:39] elukey: yep! [14:12:40] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10160970 (10ayounsi) It would indeed be great to have redundancy for the `fmsw`, but as that device is not managed, there i... [14:53:03] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161341 (10jcrespo) ms backups con codfw are stopped. As usual, not asking for priority over my workmates, but if you... [15:08:08] 10CFSSL-PKI, 10netops, 06Infrastructure-Foundations: sre.network.tls cookbook - CFSSL error: bad request - https://phabricator.wikimedia.org/T375179#10161429 (10elukey) This time we have an issue with `sign`, since a certificate is already there. I verified with manual commands and `gencert` works fine. I e... [15:08:29] XioNoX: I think that /var/preserve/csr.pem is missing on the lsw1-f1 side [15:08:36] could it be possible? [15:13:02] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161445 (10ssingh) Traffic hosts (cp2041/cp2042) are depooled. [15:14:24] elukey: let me check [15:19:05] elukey: looks like it... seems like the permanent directory isn't that permanent.. [15:19:56] elukey: are there any downsides with generating a new CSR each time we generate a new certificates ? [15:20:23] as I can't think of a good place to have them rest between repos [15:21:12] can't you save it where the private key is? [15:22:05] volans: no, it saves the public and private keys as config strings [15:23:18] once we have upgraded all the devices to Junos 22.2 or above, we can leverage their fancier PKI storage on the devices [15:25:13] ack [15:27:42] XioNoX: basically using gencert every time? Not that I know of [15:27:49] yeah [15:28:36] I re-ran the cookbook on lsw1-f1 after disabling grpc to simulate an initial bootstrap, and it worked fine [15:28:54] elukey: thanks for looking into it, I was too fast blaming cfssl :) [15:39:01] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161570 (10ABran-WMF) all data-persistence hosts have been depooled and downtimed [15:43:54] XioNoX: it is always the network! :P [16:03:54] 10netops, 06Infrastructure-Foundations, 06SRE: Top-of-rack 'MoveServersUplinks' Netbox scripts doesn't clean up the old trunk port - https://phabricator.wikimedia.org/T375216 (10cmooney) 03NEW p:05Triage→03Low [16:09:24] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161695 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a040f2d9-1940-4aba-bd29-efa9aeec87fb) set... [16:16:51] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161716 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9d0dd9cc-ca9d-4736-b81c-6f32f4a0772d) set... [16:22:44] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161819 (10cmooney) All hosts have been moved and all now responding to ping again. [16:28:52] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161862 (10ABran-WMF) d/p instances are repooling [16:31:15] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161877 (10MatthewVernon) ms-nodes all good; thanos-be2004 seems OK (but checking that picked up an unrelated replica... [16:37:38] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161888 (10jcrespo) Resumed ms backups on codfw. [16:44:45] 10netops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Test prototype fundraising pybal replacement based on haproxy + anycast-healthchecker. - https://phabricator.wikimedia.org/T373942#10161925 (10cmooney) >>! In T373942#10154484, @Dwisehaupt wrote: > Am I correct in assuming the checks from the ne... [17:29:20] 10netops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Test prototype fundraising pybal replacement based on haproxy + anycast-healthchecker. - https://phabricator.wikimedia.org/T373942#10162152 (10Dwisehaupt) @cmooney Thanks for the follow up, it's all cleared up now. Most of this came from my conf... [17:55:52] 10netops, 06Infrastructure-Foundations, 06SRE: EX4600 does not support class-of-service 'port scheduling' - https://phabricator.wikimedia.org/T373594#10162292 (10cmooney) Just a note on this task to say that I was able to perform some throughput tests on the old asw-d-codfw devices (QFX5100) which have t... [17:56:45] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10162293 (10cmooney) 05Open→03Resolved a:03cmooney All done with this. Big thanks for @Jhancock.wm for the amazing work m... [17:57:10] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#10162302 (10cmooney) >>! In T360789#9941103, @Papaul wrote: > All the cabling is done. I am leaving this task open so when we move the console cables from a... [17:57:26] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10162299 (10cmooney) 05Open→03Resolved a:03cmooney [17:58:48] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Move sretest2002 primary uplink to asw-d4-codfw - https://phabricator.wikimedia.org/T370475#10162308 (10cmooney) @Jhancock.wm thanks for doing this. I have completed my testing now on the old switch (thankfully all went well). So thi... [18:19:30] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Move sretest2002 primary uplink to asw-d4-codfw - https://phabricator.wikimedia.org/T370475#10162355 (10cmooney) Actually just checking it's still at status "planned" in Netbox. And looking at puppetboard it seems it never got added p... [21:52:22] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10162880 (10Dwisehaupt)