[04:47:34] 10netops, 10DC-Ops, 10SRE, 10ops-codfw, 10Wikimedia-Incident: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10Krinkle) [04:49:19] 10netops, 10DC-Ops, 10SRE, 10ops-codfw, 10Wikimedia-Incident: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10Krinkle) [08:22:55] 10netops, 10Infrastructure-Foundations, 10serviceops, 10Kubernetes: kubernetes1005 BGP down for 3 weeks - https://phabricator.wikimedia.org/T289111 (10ayounsi) p:05Triage→03High [08:47:36] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops, 10Kubernetes: kubernetes1005 BGP down for 3 weeks - https://phabricator.wikimedia.org/T289111 (10JMeybohm) a:03JMeybohm [09:19:08] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops, and 2 others: kubernetes1005 BGP down for 3 weeks - https://phabricator.wikimedia.org/T289111 (10JMeybohm) This happened while I was running docker pull tests 2021-07-21 ~15:04Z and kubernetes1005 is one of the dedicated sessionstore nodes runnin... [11:09:07] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops, and 2 others: kubernetes1005 BGP down for 3 weeks - https://phabricator.wikimedia.org/T289111 (10JMeybohm) K8s event logs (https://logstash.wikimedia.org/goto/b16700661b703799af5ac188db2d3f5c) are pretty clear on that I created a lot of disk pres... [13:33:44] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops, and 2 others: kubernetes1005 BGP down for 3 weeks - https://phabricator.wikimedia.org/T289111 (10JMeybohm) 05Open→03Resolved Ok, really dumb situation! A bunch of (failing) sessionstore Pods are clogging all resources on kubernetes1005, leavi... [14:23:24] 10netops, 10Alerting, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q1): Ingest Cron and Root Alerts Into Logstash - https://phabricator.wikimedia.org/T274377 (10lmata) [14:26:50] 10netops, 10Alerting, 10Infrastructure-Foundations, 10SRE: Ingest Cron and Root Alerts Into Logstash - https://phabricator.wikimedia.org/T274377 (10lmata) [14:30:17] 10CAS-SSO, 10Infrastructure-Foundations, 10Metrics, 10SRE, 10User-jbond: thanos u/i gives errors if left idle for a few hours - https://phabricator.wikimedia.org/T268233 (10lmata) [14:54:17] 10CAS-SSO, 10Puppet, 10Infrastructure-Foundations, 10Orchestrator, and 2 others: Puppet host certs do not contain Subject Alt Name entries - https://phabricator.wikimedia.org/T273637 (10Kormat) Golang 1.17 will remove support for the work-around: https://golang.org/doc/go1.16#crypto/x509 [16:40:21] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Spicerack, 10Datacenter-Switchover: switchdc should verify active/active DBs are read-write in both datacenters - https://phabricator.wikimedia.org/T287129 (10Legoktm) p:05Triage→03Low >>! In T287129#7229380, @LSobanski wrote: > Certainly makes sens... [17:14:56] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Spicerack, 10Datacenter-Switchover: switchdc should verify active/active DBs are read-write in both datacenters - https://phabricator.wikimedia.org/T287129 (10LSobanski) @Legoktm Thanks! I'd say this is not top of our priority list right now so the cook...