[06:26:16] 10Traffic, 10Infrastructure-Foundations, 10SRE-tools: Cookbook to depool a site in AuthDNS - https://phabricator.wikimedia.org/T334048 (10ayounsi) [07:00:37] 10Traffic, 10Data-Engineering, 10Data-Persistence, 10Infrastructure-Foundations, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) [07:00:57] 10Traffic, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) [07:08:24] 10Traffic, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) [07:10:15] 10Traffic, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) @jcrespo could you double check the backup-related hosts? Thanks! [07:10:32] 10Traffic, 10netops, 10Data-Engineering, 10Data-Persistence, and 8 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ayounsi) p:05Triage→03Medium [07:12:38] 10Traffic, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) [07:13:06] 10Traffic, 10netops, 10Data-Engineering, 10Data-Persistence, and 8 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ayounsi) [07:13:24] 10Traffic, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) m3-master and m5-master have been failed over. [07:13:56] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 8 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Marostegui) [07:21:21] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 8 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Marostegui) @ayounsi to confirm, codfw will be depooled before this maintenance right? @akosiaris @Joe ? [07:30:49] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10akosiaris) Yes, we 'll have to depool codfw. [07:31:07] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ayounsi) >>! In T334049#8757732, @Marostegui wrote: > @ayounsi to confirm, codfw will be depooled before this maintenance right? @akosiaris @Joe ? That's my understandi... [07:31:35] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10MoritzMuehlenhoff) [07:37:59] 10Traffic, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10MoritzMuehlenhoff) [07:43:59] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops, 10Patch-For-Review: Optimize k8s same row traffic flows - https://phabricator.wikimedia.org/T328523 (10ayounsi) 05Open→03Resolved a:03ayounsi This has been rolled to all k8s clusters. [07:45:44] 10Traffic, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10jcrespo) [07:46:35] 10Traffic, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10jcrespo) >>! In T333377#8757686, @Marostegui wrote: > @jcrespo could you double check the backup-related hosts? Thanks! Documented- minor to no disr... [07:47:42] 10Traffic, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10jcrespo) [08:00:07] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10ayounsi) 05Open→03Stalled Marking it as stalled until the cookbook is reviewed/merged. [08:01:12] 10netops, 10Infrastructure-Foundations, 10SRE: Agree how to handle port-block speeds for QFX5120-48Y - https://phabricator.wikimedia.org/T303529 (10ayounsi) a:03cmooney [08:02:07] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Marostegui) [08:03:39] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Marostegui) @jcrespo kindly check backup servers needs. Thanks [08:19:15] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10jcrespo) [08:32:02] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10User-jbond: Anycast: consistent ICMP packet too big routing - https://phabricator.wikimedia.org/T253732 (10ayounsi) https://www.juniper.net/documentation/us/en/software/junos/system-mgmt-monitoring/topics/ref/statement/enhanced-hash-key-e... [09:07:22] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [09:15:12] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [09:15:33] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [09:22:45] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [09:33:57] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10MoritzMuehlenhoff) [09:34:23] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10MoritzMuehlenhoff) [09:46:57] (PurgedHighEventLag) firing: (3) High event process lag with purged on cp1082:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [09:51:57] (PurgedHighEventLag) firing: (5) High event process lag with purged on cp1082:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:10:38] mmmm [10:11:39] this is a little weird, it seems that purged doesn't like the last kafka-main restarts [10:11:57] trying to restart purged on a cp node to see if it is only temporary [10:12:59] yeah very weird, now I don't see issues anymore [10:15:35] all right restarted on the remaining 4 nodes [10:16:57] (PurgedHighEventLag) firing: (10) High event process lag with purged on cp1082:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:17:12] (PurgedHighEventLag) resolved: (10) High event process lag with purged on cp1082:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:18:43] so afaics it seems that purged doesn't like very well group rebalances [10:23:10] 10Traffic: purged issues whule kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10elukey) [10:23:15] created --^ [10:25:23] 10Traffic: purged issues whule kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10elukey) I see stable tcp connections to kafka-main1002 and 1003 after the restart on cp5032, so I am inclined to say that it is not related to the PKI cert rollout: ` elukey@cp5032:~$ sudo netstat -tunap |... [10:29:10] vgutierrez: let's sync later on on --^ if you have time [10:29:59] ah no of course you are afk today, sorry for the ping! [10:36:42] No problem [11:32:44] 10Traffic, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10hnowlan) [11:33:44] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10hnowlan) [12:07:08] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ayounsi) a:03ayounsi [12:10:00] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Allow managing drmrs DHCP settings with Homer - https://phabricator.wikimedia.org/T328737 (10ayounsi) a:03ayounsi Taking that task, even if the current CR does the job, it could be refactored with @cmooney work to remove the duplicated co... [12:11:30] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ayounsi) [12:28:02] 10netops, 10Infrastructure-Foundations, 10SRE: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845 (10ayounsi) 05Open→03Resolved a:03ayounsi This is completed in drmrs, the same will be applied to the other sites when we bring L3 on the ToR switches as I don't think... [13:04:51] 10Traffic, 10SRE: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10Aklapper) [13:58:32] 10Traffic, 10SRE: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10Ottomata) From a brief glance, those look like normal consumer reassignment messages. Probably shouldn't be alerts. [14:04:25] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10nskaggs) As an update, this is now blocked on {T297596}. The previous implementation discussion led to a finalization of guidelines, w... [14:04:44] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10nskaggs) [15:06:02] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10colewhite) [16:20:55] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs4008.ulsfo.wmnet with OS bullseye [16:47:50] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs4008.ulsfo.wmnet with OS bullseye executed with errors: - lvs4008 (**FAIL**) - Downtimed on Icinga/Alertmanager... [16:48:02] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs4008.ulsfo.wmnet with OS bullseye [17:22:37] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs4008.ulsfo.wmnet with OS bullseye completed: - lvs4008 (**WARN**) - Downtimed on Icinga/Alertmanager - //Unable... [17:51:06] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs4009.ulsfo.wmnet with OS bullseye [17:51:22] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [18:37:39] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs4009.ulsfo.wmnet with OS bullseye executed with errors: - lvs4009 (**FAIL**) - Downtimed on Icinga/Alertmanager... [18:38:03] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs4009.ulsfo.wmnet with OS bullseye [19:13:03] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs4009.ulsfo.wmnet with OS bullseye completed: - lvs4009 (**PASS**) - Removed from Puppet and PuppetDB if present... [19:24:46] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [19:53:04] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs5004.eqsin.wmnet with OS bullseye [20:44:00] 10Traffic, 10Infrastructure-Foundations, 10SRE: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10CDanis) [20:44:59] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs5004.eqsin.wmnet with OS bullseye completed: - lvs5004 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled... [20:46:29] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [20:49:21] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [21:16:28] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs5005.eqsin.wmnet with OS bullseye [21:26:06] 10Traffic, 10Infrastructure-Foundations: Manual upload of iDRAC EXE results in broken web interface - https://phabricator.wikimedia.org/T334146 (10BCornwall) [22:05:52] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs5005.eqsin.wmnet with OS bullseye completed: - lvs5005 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled... [22:12:17] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)