[07:34:24] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migration to containerd and away from docker - https://phabricator.wikimedia.org/T362408#10216554 (10JMeybohm) There are some hardware refreshes planned which should go Bookworm + containerd right away: - {T376171} - {T376185} - {T376170} [09:14:51] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migration to containerd and away from docker - https://phabricator.wikimedia.org/T362408#10216805 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestage1003.eqiad.wmnet with OS bookworm [09:50:59] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migration to containerd and away from docker - https://phabricator.wikimedia.org/T362408#10216910 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestage1003.eqiad.wmnet with OS bookworm com... [09:58:47] 06serviceops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Timeout while retrieving the catalog from the Docker Registry - https://phabricator.wikimedia.org/T376285#10216942 (10elukey) 05Open→03Resolved a:03elukey The issue seems solved, I tested docker-report multiple times and I did... [11:00:32] 06serviceops, 06cloud-services-team, 10MW-on-K8s, 10wikitech.wikimedia.org: Review/update wikitech-static syncing after wikitech moves to Kubernetes - https://phabricator.wikimedia.org/T374114#10217102 (10fnegri) I noticed there's an alert firing, probably related to this work: > MWVERSION WARNING - wikit... [11:01:34] 06serviceops, 10MediaWiki-extensions-PropertySuggester, 10MW-on-K8s, 10Wikidata, 10wmde-wikidata-tech: Update PropertySuggester update process for mwscript-k8s - https://phabricator.wikimedia.org/T376604#10217120 (10ItamarWMDE) **Prio Notes:** | Impact Area | Affected | |----------... [11:01:59] 06serviceops, 10MediaWiki-extensions-PropertySuggester, 10MW-on-K8s, 10Wikidata, 10wmde-wikidata-tech: [PropertySuggester] Update PropertySuggester update process for mwscript-k8s - https://phabricator.wikimedia.org/T376604#10217125 (10ItamarWMDE) [11:04:59] 06serviceops, 10MediaWiki-extensions-PropertySuggester, 10MW-on-K8s, 10Wikidata, 10wmde-wikidata-tech: [PS] Update PropertySuggester update process for mwscript-k8s - https://phabricator.wikimedia.org/T376604#10217138 (10ItamarWMDE) [12:00:52] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migration to containerd and away from docker - https://phabricator.wikimedia.org/T362408#10217294 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestage1004.eqiad.wmnet with OS bookworm [12:06:01] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migration to containerd and away from docker - https://phabricator.wikimedia.org/T362408#10217299 (10Clement_Goubert) >>! In T362408#10216554, @JMeybohm wrote: > There are some hardware refreshes planned which should go Bookworm + container... [12:30:56] 06serviceops, 06cloud-services-team, 10MW-on-K8s, 10wikitech.wikimedia.org: Review/update wikitech-static syncing after wikitech moves to Kubernetes - https://phabricator.wikimedia.org/T374114#10217393 (10Reedy) It's because there were MW releases last week and no one has updated wikitech-static yet :) [12:34:25] 06serviceops, 10MW-on-K8s: mw-debug-repl fails with `container mediawiki-pinkunicorn-app is not valid for pod mw-debug.codfw.next-5d785576b4-sq6dv` - https://phabricator.wikimedia.org/T376895 (10Zabe) 03NEW [12:38:43] 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Migration to containerd and away from docker - https://phabricator.wikimedia.org/T362408#10217434 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestage1004.eqiad.wmnet with OS bookworm completed: - kubestage100... [12:49:18] 06serviceops, 10MW-on-K8s, 10Scap: Evaluate the performance improvements brought in by prefetching MW images on WikiKube hosts - https://phabricator.wikimedia.org/T366778#10217508 (10akosiaris) 05Open→03Resolved a:03akosiaris I 'll resolve this one. Things overall are OK deployment times wise. In f... [12:56:09] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Allow running one-off scripts manually - https://phabricator.wikimedia.org/T341553#10217538 (10Urbanecm_WMF) On another note, how do we think about one-off maintenance scripts? `mwscript` allows me to run a script from my home, which I used before to debug issue... [12:57:16] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Allow running one-off scripts manually - https://phabricator.wikimedia.org/T341553#10217525 (10Urbanecm_WMF) >>! In T341553#10194349, @RLazarus wrote: >>>! In T341553#10192994, @taavi wrote: >> Is it possible to include a text file from disk in the container whe... [13:06:45] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Allow running one-off scripts manually - https://phabricator.wikimedia.org/T341553#10217571 (10Lucas_Werkmeister_WMDE) >>! In T341553#10217538, @Urbanecm_WMF wrote: > On another note, how do we think about one-off maintenance scripts? `mwscript` allows me to run... [13:26:20] 06serviceops, 06Infrastructure-Foundations, 06SRE: Clean up the Docker Registry catalog and Swift storage from old images - https://phabricator.wikimedia.org/T375645#10217676 (10elukey) I put some thoughts on the current situation, and even if there are a lot of unknowns, I realized that garbage collection m... [13:28:55] 06serviceops, 10MW-on-K8s: mw-debug-repl fails with `container mediawiki-pinkunicorn-app is not valid for pod mw-debug.codfw.next-5d785576b4-sq6dv` - https://phabricator.wikimedia.org/T376895#10217684 (10Clement_Goubert) [13:28:57] 06serviceops, 13Patch-For-Review: Turn up PHP 8.1-flavored mw-debug k8s deployment - https://phabricator.wikimedia.org/T372604#10217685 (10Clement_Goubert) [13:31:13] 06serviceops, 10MW-on-K8s: mw-debug-repl fails with `container mediawiki-pinkunicorn-app is not valid for pod mw-debug.codfw.next-5d785576b4-sq6dv` - https://phabricator.wikimedia.org/T376895#10217690 (10Clement_Goubert) This is because of the introduction of a `next` release. Fix for the script incoming will... [13:34:04] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: mw-debug-repl fails with `container mediawiki-pinkunicorn-app is not valid for pod mw-debug.codfw.next-5d785576b4-sq6dv` - https://phabricator.wikimedia.org/T376895#10217728 (10Clement_Goubert) 05Open→03In progress [13:34:59] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: mw-debug-repl fails with `container mediawiki-pinkunicorn-app is not valid for pod mw-debug.codfw.next-5d785576b4-sq6dv` - https://phabricator.wikimedia.org/T376895#10217730 (10Clement_Goubert) a:03Clement_Goubert [13:35:03] 06serviceops, 06SRE: low rate of mw-memcached errors - https://phabricator.wikimedia.org/T371881#10217731 (10jijiki) [14:16:15] 06serviceops, 06DC-Ops, 10ops-codfw, 10Prod-Kubernetes: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10217895 (10Jhancock.wm) new drive installed. looks like alert has cleared. lmk if you need any further assistance. [14:27:05] 06serviceops, 06Content-Transform-Team-WIP, 10Page Content Service, 10RESTBase Sunsetting, and 2 others: hewiki: Use backing node service instead of RESTBase on pregeneration changeprop rules - https://phabricator.wikimedia.org/T372749#10217933 (10cscott) [14:27:23] 06serviceops, 06Content-Transform-Team-WIP, 10Page Content Service, 07Code-Health-Objective: hewiki: Route mobile-html to the backing node service instead of RESTBase - https://phabricator.wikimedia.org/T372746#10217934 (10cscott) [14:27:25] 06serviceops: low rate of mw-memcached errors - https://phabricator.wikimedia.org/T371881#10217935 (10jijiki) [15:02:30] 06serviceops, 06DC-Ops, 10ops-codfw, 10Prod-Kubernetes: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10218077 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Low a:03Clement_Goubert Yay, thank you! [15:44:09] Did something odd happen to the k8s-staging cluster? https://grafana-rw.wikimedia.org/d/000000473/kubernetes-pods?orgId=1&from=now-3h&to=now shows it suddenly going very quiet ~ 2hrs ago and I get "no healthy upstream" when trying to ping my services in it. [15:55:37] Maybe related to the containerd work jayme ^ [15:59:36] And my deploy failed to deploy or rollback. [16:04:12] one of the two nodes is cordoned and there's a lot of pending pods [16:04:42] 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Migration to containerd and away from docker - https://phabricator.wikimedia.org/T362408#10218515 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestage1003.eqiad.wmnet with OS bookworm [16:04:54] ah ^ lol [16:04:56] yeah i just don't know if jayme is done upgrading to containerd on that node well there you go he hasn't [16:05:15] it seems like two nodes for the staging cluster isn't enough anymore [16:05:41] we've identified that and will add a node via an upcoming refresh [16:05:51] James_F: ok so sorry but you just have to wait a bit :) [16:05:55] maybe we can do it a little earlier than that [16:05:55] Sure! [16:07:53] 06serviceops, 10MediaWiki-Uploading, 10MW-1.43-notes (1.43.0-wmf.26; 2024-10-08), 13Patch-For-Review, and 3 others: Large file uploads broken via Special:Upload - https://phabricator.wikimedia.org/T374436#10218512 (10Krinkle) [16:31:12] James_F: sorry, my laptop crashed mid meeting series and I did not restart the IRC client since now. I made a misstake during reimage of one of the workers and it ended up out of disk [16:31:23] Ack, no worries. [16:31:39] Was just worried if prod was broken somehow and someone needed to know about it. [16:32:38] nono, all "good" [16:33:16] and yes, one node is unfortunately no longer enough. So during reimages we're in a 'pending pods' situation until we have some scrap hardware to add to the cluster [16:34:15] I saw that e.g. the staging changeprop nodes use > 0.5 GIB of RAM each, and e.g. shellbox has a bunch of nodes even on staging, which were surprising. [16:34:31] 06serviceops, 10MW-on-K8s, 13Patch-For-Review, 10Sustainability (Incident Followup): mwscript-k8s creates too many resources - https://phabricator.wikimedia.org/T376795#10218645 (10CDanis) Turns out the object counts are already in Prometheus. Here's a quick plot on a dashboard: https://grafana.wikimedia.... [16:36:10] yeah, there might be room for optimization. But it's a tedious process (and probably not enough) [16:36:15] especially mid-term [16:36:29] Just bill each team for every GiB they use, it works for AWS. ;-) [16:40:50] 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Migration to containerd and away from docker - https://phabricator.wikimedia.org/T362408#10218655 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestage1003.eqiad.wmnet with OS bookworm completed: - kubestage100... [16:51:30] James_F: be my guest :p [16:51:49] kubestage1003 is up again, things should stabilize [16:51:51] * James_F tries. [16:53:11] Yup, deploy worked. [16:53:25] cool. Sorry for causing trouble [16:53:51] 06serviceops, 13Patch-For-Review: echostore's TLS certificate expires on 2024-10-13 - https://phabricator.wikimedia.org/T376766#10218712 (10Scott_French) ` $ curl -v 'https://echostore.svc.codfw.wmnet:8082/healthz' ... snip ... * Server certificate: * subject: CN=kask-production-tls-proxy-certs * start date... [17:10:14] 06serviceops, 10MW-on-K8s, 13Patch-For-Review, 10Sustainability (Incident Followup): mwscript-k8s creates too many resources - https://phabricator.wikimedia.org/T376795#10218736 (10JMeybohm) >>! In T376795#10218645, @CDanis wrote: > Turns out the object counts are already in Prometheus. Here's a quick plo... [17:11:52] 06serviceops, 13Patch-For-Review: echostore's TLS certificate expires on 2024-10-13 - https://phabricator.wikimedia.org/T376766#10218744 (10Scott_French) echostore in codfw is back to normal traffic levels and mediawiki (envoy) -> echostore metrics look good (no 5xx errors, latency has returned to normal after... [18:37:51] 06serviceops, 10MW-on-K8s: Support bringing text files into the container for one-off maintenance scripts - https://phabricator.wikimedia.org/T376230#10219125 (10Urbanecm_WMF) > To use scripts like attachAccount.php that take a filename on the command line, without modifying the code, passing --userlist php://... [18:40:16] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Allow running one-off scripts manually - https://phabricator.wikimedia.org/T341553#10219135 (10Urbanecm_WMF) >>! In T341553#10217571, @Lucas_Werkmeister_WMDE wrote: >>>! In T341553#10217538, @Urbanecm_WMF wrote: >> On another note, how do we think about one-off... [18:44:17] 06serviceops, 10MW-on-K8s: Support bringing text files into the container for one-off maintenance scripts - https://phabricator.wikimedia.org/T376230#10219151 (10RLazarus) Ha, `attachAccount.php` [[ https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/CentralAuth/+/f3436037e10769d4f8fb9d18dc84b4... [19:38:36] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: --timeout flag for mwscript-k8s - https://phabricator.wikimedia.org/T376099#10219406 (10RLazarus) 05Open→03Resolved This is now supported! ` --timeout TIMEOUT Set a deadline for the job, to interrupt it after a set interval. Examples: 1d, 2h, 30m, 4... [19:56:15] 06serviceops: echostore's TLS certificate expires on 2024-10-13 - https://phabricator.wikimedia.org/T376766#10219468 (10Scott_French) p:05High→03Low Alright, we've been 100% on mesh-flavored echostore in both DCs for about 2h now, and no issues have been encountered. Since the immediate threat of certificate... [20:04:28] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Allow running one-off scripts manually - https://phabricator.wikimedia.org/T341553#10219521 (10RLazarus) >>! In T341553#10217525, @Urbanecm_WMF wrote: > [...] > The mediafiles can be very large – I've certainly uploaded files that had dozens of GBs in total. As... [20:04:35] 06serviceops, 10MW-on-K8s: MWScript.php doesn't allow wikiless scripts without the .php suffix - https://phabricator.wikimedia.org/T376616#10219522 (10RLazarus) [22:55:21] 06serviceops, 06DC-Ops, 10ops-codfw: Q#:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965 (10RobH) 03NEW [22:55:42] 06serviceops, 06DC-Ops, 10ops-codfw: Q#:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10220101 (10RobH) [22:55:58] 06serviceops, 06DC-Ops, 10ops-codfw: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10220102 (10RobH) [22:56:57] 06serviceops: wikikube-worker21[56-70] implementation tracking - https://phabricator.wikimedia.org/T376966 (10RobH) 03NEW [23:05:04] 06serviceops, 06DC-Ops, 10ops-codfw: Q#:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968 (10RobH) 03NEW [23:05:08] 06serviceops, 06DC-Ops, 10ops-codfw: Q2:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968#10220158 (10RobH) [23:05:55] 06serviceops, 06DC-Ops, 10ops-codfw: Q2:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968#10220166 (10RobH) [23:06:42] 06serviceops: mc-gp200[4-6] implementation tracking - https://phabricator.wikimedia.org/T376969 (10RobH) 03NEW