[09:28:45] 06serviceops, 10ops-codfw, 06SRE: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938#9756823 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b2b315a7-d925-49a5-80d5-19849b998b72) set by jayme@cumin1002 for 2 days, 0:00:00 on 1 host(s) and their services with reason: Degra... [10:16:51] 06serviceops, 10ops-codfw, 06SRE: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938#9757021 (10JMeybohm) @Jhancock.wm I've tried powercycling the system and to restart iDRAC to see if the storage controller "comes back" but no luck. During boot I did see 2 SATA drives listed, though. Ofc. /... [10:44:54] 06serviceops, 10ops-codfw, 06SRE: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938#9757115 (10JMeybohm) @Jhancock.wm I did shutdown the server for now. Could you please try do drain flea power and see if the controller comes back after? If not please open a case with Dell [11:01:36] 06serviceops, 10ops-codfw, 06SRE: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938#9757195 (10JMeybohm) Host is set pooled=inactive, cordoned in k8s, removed from BGP and shut down, so all yours [11:57:47] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 12), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9757373 (10SGupta-WMF) @Scott_French Thank you ! We are in process of creating... [13:27:04] 06serviceops, 10ops-codfw, 06SRE: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938#9757729 (10taavi) [13:55:12] 06serviceops, 06Machine-Learning-Team, 10MW-on-K8s, 06SRE, 13Patch-For-Review: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9757868 (10elukey) All changes rebased and ready to go (for prod). The main idea is the following: * Remove WIKI_URL for revscoring isvcs, so we'll... [13:55:13] claime: o/ [13:55:13] we are almost ready to go with the mw-api-int-ro switch! [13:55:13] just an FYI about rps - lift wing eqiad does between 150 and 200 rps, codfw ~60 rps [13:55:21] it is a little more than the 100 that we discussed, lemme know if it is a problem on your side [13:55:27] (like more pods needed etc..) [13:56:02] 06serviceops, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9757891 (10MoritzMuehlenhoff) [14:07:48] 06serviceops, 10ops-codfw, 06SRE: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938#9757948 (10Jhancock.wm) draining didn't fix it. I'm gonna update the firmware and bios and then see where it is. [14:13:55] 06serviceops, 06Machine-Learning-Team, 10MW-on-K8s, 06SRE, 13Patch-For-Review: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9757995 (10elukey) a:03elukey [14:28:30] elukey: c.lem is out for two weeks(?), but I think double the 200rps should be fine [15:19:24] jayme: ahhh snap didn't check before pinging, thanks [15:56:34] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Wikikube staging clusters are out of IPv4 Pod IP's - https://phabricator.wikimedia.org/T345823#9758362 (10JMeybohm) I've moved staging-codfw to /28 blocks using the process outlined in the calico docs. Instead of re-scheduling all pods twic... [16:13:14] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Wikikube staging clusters are out of IPv4 Pod IP's - https://phabricator.wikimedia.org/T345823#9758460 (10JMeybohm) a:03JMeybohm For the record: /30 blocks led to too many prefix announcements so the BGP sessions got blocked by the router... [16:36:15] 06serviceops, 10ops-codfw, 06SRE: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938#9758615 (10Jhancock.wm) idrac upgraded to 7.0.0. won't go any higher. Bios is already at 2.9.3. Reset the factory defaults and tried rebooting the idrac. reseated the backplane. None of these have fixed the... [19:23:37] looks like staging k8s logs aren't in logstash? https://logstash.wikimedia.org/goto/386d707856fd6f4d94043299c77c67d5 . I've tried several namespaces and they seem to disappear around 1745 UTC? ... I just pinged in observability as well [19:34:41] Follow-up on ^^, this is a known issue (see T363856 ) [20:21:38] 06serviceops: etcdmirror does not recover from a cleared waitIndex - https://phabricator.wikimedia.org/T358636#9759478 (10Scott_French) Well, that seems to have gone off without significant issue. Many thanks to @Volans and @RLazarus for all of your help. A couple of observations and / or lessons learned: **De... [23:21:26] 06serviceops: etcdmirror does not recover from a cleared waitIndex - https://phabricator.wikimedia.org/T358636#9759901 (10Scott_French) 05In progress→03Resolved I've updated https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster to reflect the new state of the world, while also fixing up some out-of-date...