[08:07:47] 10serviceops, 10MinT, 10Language-Team (Language-2023-October-December): Provide python3-build-bookworm docker image - https://phabricator.wikimedia.org/T352733 (10KartikMistry) p:05Triage→03Medium a:03KartikMistry [09:40:50] 10serviceops, 10MinT, 10Language-Team (Language-2023-October-December): Provide python3-build-bookworm docker image - https://phabricator.wikimedia.org/T352733 (10Pginer-WMF) 05Open→03Resolved [10:25:33] 10serviceops, 10Data-Platform-SRE, 10SRE, 10Discovery-Search (Current work): SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10Gehel) [10:46:34] 10serviceops, 10CX-cxserver, 10RESTBase Sunsetting, 10Language-Team (Language-2024-January-March), 10Patch-For-Review: Make cxserver call parsoid endpoints on MediaWiki, instead of going through RESTbase - https://phabricator.wikimedia.org/T344982 (10Pginer-WMF) [12:21:33] 10serviceops, 10SRE, 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Discovery-Search (Current work): SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10Gehel) [14:47:22] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Limit the concurrency of envoy in service mesh - https://phabricator.wikimedia.org/T354532 (10JMeybohm) [14:50:06] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Limit the concurrency of envoy in service mesh - https://phabricator.wikimedia.org/T354532 (10Joe) It seems to me that trying to respond to 1k rps with a concurrency of 2 is probably the issue. Throttling is bad because it raises latencies; if any measure we tak... [15:12:52] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Limit the concurrency of envoy in service mesh - https://phabricator.wikimedia.org/T354532 (10JMeybohm) >>! In T354532#9441771, @Joe wrote: > It seems to me that trying to respond to 1k rps with a concurrency of 2 is probably the issue. Throttling is bad because... [15:23:44] 10serviceops, 10WMF-JobQueue, 10Wikimedia-production-error: Make changeprop-jobqueue error handling/httpbb tests better behaved: Uncaught Error: Class 'MWExceptionHandler' not found in /srv/mediawiki/rpc/RunSingleJob.php:42 - https://phabricator.wikimedia.org/T352265 (10hnowlan) >>! In T352265#9420602, @matm... [15:38:06] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10netops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Clement_Goubert) 05Open→03In progress a:05Clement_Goubert→03Papaul Host is now drained and cordoned. It is in codfw rack... [15:48:59] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8ffd1f55-9e4b-4439-910b-8b498e421351) set by cgoubert@cumin2002 for 7 days, 0:00:00 on 1 host(s) and their service... [16:23:45] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10netops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Papaul) @Clement_Goubert thanks will work on it in a minute [16:25:36] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10netops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10cmooney) @papaul let me know what port is used on lsw1-b8-codfw once done and I will make the Netbox changes and assign new IPs f... [16:35:58] 10serviceops, 10MW-on-K8s: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host mw1377.eqiad.wmnet with OS bullseye [16:45:31] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Handle sidecar containers in one-off Kubernetes jobs - https://phabricator.wikimedia.org/T348284 (10CodeReviewBot) rzl closed https://gitlab.wikimedia.org/repos/sre/k8s-controller-sidecars/-/merge_requests/3 Check for running containers, not ready containers [16:47:24] scap is failing with disk full https://www.irccloud.com/pastebin/Fju2tGn0/ [16:48:00] /dev/mapper/mw2259--vg-srv 40G 38G 4.0K 100% /srv [16:49:17] https://www.irccloud.com/pastebin/i8dURgYE/ [16:49:49] https://www.irccloud.com/pastebin/IyI44O47/ [16:49:59] I'm going to drop those [16:51:28] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=mw2259&var-datasource=thanos&var-cluster=wmcs&from=1704722262147&to=1704732671168&viewPanel=12 [16:54:35] 10serviceops, 10MW-on-K8s: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host mw1377.eqiad.wmnet with OS bullseye executed with errors: - mw1377 (**FAIL**) - Removed from Puppet... [16:55:28] Amir1: thanks, I'll check if there are others where php-1.39.0-wmf.19 is still on disk [16:55:34] was it the only one you removed ? [16:56:09] no, I removed all of 1.39s and 1.40 [16:56:13] kept 1.42s [16:56:22] ack [16:58:26] we should probably check the ones that are high on disk space usage [16:59:41] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=mw2259&var-datasource=thanos&var-cluster=wmcs&from=1704729517544&to=1704733132500&viewPanel=12 [17:00:01] Amir1: might as well just cumin a rm of php-1.{39,40}* [17:00:09] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host mw1377.eqiad.wmnet with OS bullseye [17:00:18] claime: you're giving me evil ideas [17:00:28] do tell [17:00:57] well, now almost every deploy failed [17:01:26] same problem? [17:01:29] claime: what is the alias for all mw appservers? [17:01:30] yeah [17:01:35] A:all-mw [17:01:39] and I think it might cause a major issue right now [17:01:41] thanks [17:02:32] tell me if you want me to do anything [17:34:18] 10serviceops, 10MW-on-K8s: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host mw1377.eqiad.wmnet with OS bullseye completed: - mw1377 (**WARN**) - Downtimed on Icinga/Alertmanag... [17:38:05] 10serviceops, 10MW-on-K8s: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10kamila) Note a few other things that have been tried: - Updated firmware (UEFI + iDRAC) on mw1378, did not help. - Appears to be related to the `wdat_wdt` watchdog driver (all affected hosts have... [17:41:54] 10serviceops, 10MW-on-K8s: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10kamila) p:05Triage→03High [17:42:29] 10serviceops, 10MW-on-K8s: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10kamila) [17:58:02] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10netops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Papaul) @cmooney xe-0/0/26 [17:58:54] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10netops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Papaul) [18:01:04] 10serviceops, 10SRE, 10ops-codfw: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Papaul) mainboard repalced by @Jhancock.wm . She is running the provision cookbook now. [19:43:43] 10serviceops, 10Parsoid (Tracking): Upgrade nodejs on testreduce1001 - https://phabricator.wikimedia.org/T345220 (10Arlolra) https://parsoid-rt-tests.wikimedia.org/ now looks correct I guess the last step here is to decommission 1001