[09:40:09] 10netops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10JMeybohm) Global depool of a/a services from codfw is done. [10:25:23] 10SRE-tools, 10Infrastructure-Foundations: Decide which cookbooks using icinga_hosts.wait_for_optimal() should use skip_acked=True - https://phabricator.wikimedia.org/T330136 (10Volans) p:05Triage→03Medium [10:25:42] moritzm: FYI I've created the above task ^^^ [11:09:19] ack, going to amend the task in a bit with the two cases I ran into [11:09:39] thx [11:10:39] 10netops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10jbond) [11:16:46] 10netops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 10 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10jbond) [11:18:17] 10netops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 10 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10fgiunchedi) [13:24:42] 10SRE-tools, 10Infrastructure-Foundations, 10SRE: RAID check opened a ticket for kubernetes2012 while it was being reimaged - https://phabricator.wikimedia.org/T330150 (10jcrespo) [13:28:11] 10SRE-tools, 10Infrastructure-Foundations, 10SRE: RAID check opened a ticket for kubernetes2012 while it was being reimaged - https://phabricator.wikimedia.org/T330150 (10jcrespo) This looks like it could be avoided with some extra check, maybe? I added @jbond and @Volans as I think they were involved in th... [13:29:17] 10SRE-tools, 10Infrastructure-Foundations, 10observability: RAID check opened a ticket for kubernetes2012 while it was being reimaged - https://phabricator.wikimedia.org/T330150 (10jcrespo) [13:33:52] 10SRE-tools, 10Infrastructure-Foundations, 10observability: RAID check opened a ticket for kubernetes2012 while it was being reimaged - https://phabricator.wikimedia.org/T330150 (10jcrespo) Adding @JMeybohm in case it was just a fluke (reimage taking more time than usual). [13:36:13] 10SRE-tools, 10Infrastructure-Foundations, 10observability: RAID check opened a ticket for kubernetes2012 while it was being reimaged - https://phabricator.wikimedia.org/T330150 (10JMeybohm) /cc @elukey this is one of "yours" :) [13:37:42] 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10elukey) In the meantime we have created two cookbook: * sre.k8s.upgrade-cluster.py * sre.k8s.wipe-cluster.py [13:41:35] 10netops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=aec8ddda-9ad5-4b7f-8bca-c273e036a282) set by ayounsi@cumin1001 for 2:00:00 on 215 hos... [13:45:09] 10CAS-SSO, 10Data-Catalog, 10Data-Engineering, 10Infrastructure-Foundations, 10Shared-Data-Infrastructure: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10JArguello-WMF) [13:45:24] 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10fgiunchedi) Following up for silences, especially the ones paging in production (`ProbeDown`). * ProbeDown: the most e... [13:47:23] 10CAS-SSO, 10Data-Catalog, 10Data-Engineering, 10Infrastructure-Foundations, 10Shared-Data-Infrastructure: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10JArguello-WMF) Switching DataHub to OIDC authentication (T305874) is a big job, so we'll schedule it for the next... [13:55:28] 10netops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Vgutierrez) [14:37:20] 10netops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10ayounsi) Upgrade went smoothly, less than 15min hard downtime here too. [14:37:58] 10netops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10fgiunchedi) [14:45:20] 10netops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10jbond) [14:50:53] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [14:51:28] 10netops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10fgiunchedi) [15:06:02] 10netops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10jcrespo) I restarted es5 codfw backup job, the only backup-related thingy affected by the downtime. [15:07:33] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [15:48:23] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: codfw: Bad power supply on cr1-codfw(PEM 0) - https://phabricator.wikimedia.org/T329943 (10Papaul) 05Open→03Resolved Replaced PEM0 everything looks good now . {F36864090} [15:51:15] 10netops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Jelto) [16:27:13] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: asw-a-codfw management interface unreachable - https://phabricator.wikimedia.org/T330048 (10Papaul) 05Open→03Resolved Rebooting the mgmt switch fix the issue [17:14:24] 10netops, 10Infrastructure-Foundations, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Unplanned, and 2 others: [cloudvirt] Move to jumbo frames - https://phabricator.wikimedia.org/T330075 (10aborrero) [20:04:24] 10netops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10colewhite) [22:14:25] 10netops, 10Infrastructure-Foundations, 10SRE, 10IPv6, 10User-jbond: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10Aklapper) [22:14:39] 10Puppet, 10Infrastructure-Foundations, 10Patch-Needs-Improvement, 10User-jbond: Refactor puppet-merge - https://phabricator.wikimedia.org/T254249 (10Aklapper)