[02:12:40] 10serviceops, 10MediaWiki-Engineering: Fold services recommendations into Standards for services RfC - https://phabricator.wikimedia.org/T239856 (10Krinkle) a:05Krinkle→03Joe //Reflecting out-of-bound state on Phab.// [02:22:19] 10serviceops, 10All-and-every-Wikisource, 10Thumbor, 10MW-1.41-notes (1.41.0-wmf.13; 2023-06-13), 10Patch-For-Review: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 (10Samwilson) [09:34:01] 10serviceops, 10Dumps-Generation, 10Infrastructure-Foundations, 10SRE-tools, and 2 others: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10Volans) Another datapoint for the mw*/parse* clusters, they will be migrated to be k8s hosts, that are su... [10:54:26] 10serviceops, 10Prod-Kubernetes, 10observability, 10Kubernetes: Increase visibility of container/pod ressource exhaustion - https://phabricator.wikimedia.org/T266216 (10kamila) [10:54:45] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review, 10User-jijiki: Deploy kube-state-metrics - https://phabricator.wikimedia.org/T264625 (10kamila) 05In progress→03Resolved KSM is deployed in other clusters and appears to work, so I'm closing this :-) [12:06:58] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1001 for host mw2423.codfw.wmnet with OS bullseye [12:07:25] 10serviceops, 10MW-on-K8s, 10MediaWiki-Platform-Team, 10Patch-For-Review: mcrouter daemonset on mw-on-k8s - https://phabricator.wikimedia.org/T346690 (10Clement_Goubert) I think we can also set it in the php-fpm pool conf like ` env[MCROUTER_SERVER] = $MCROUTER_SERVER ` and set `$MCROUTER_SERVER` through k... [12:10:49] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1001 for host mw2424.codfw.wmnet with OS bullseye [12:18:08] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1001 for host mw2434.codfw.wmnet with OS bullseye [12:25:30] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1001 for host mw2435.codfw.wmnet with OS bullseye [12:28:42] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1001 for host mw1463.eqiad.wmnet with OS bullseye [12:46:02] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1001 for host mw2423.codfw.wmnet with OS bullseye completed: - mw2423 (**PASS**) - Downtimed on I... [12:50:39] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1001 for host mw2424.codfw.wmnet with OS bullseye completed: - mw2424 (**PASS**) - Downtimed on I... [12:58:51] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1001 for host mw2434.codfw.wmnet with OS bullseye completed: - mw2434 (**PASS**) - Downtimed on I... [13:02:44] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1001 for host mw1463.eqiad.wmnet with OS bullseye completed: - mw1463 (**PASS**) - Downtimed on I... [13:06:13] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1001 for host mw1464.eqiad.wmnet with OS bullseye [13:06:40] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1001 for host mw2435.codfw.wmnet with OS bullseye completed: - mw2435 (**PASS**) - Downtimed on I... [13:08:59] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1001 for host mw1465.eqiad.wmnet with OS bullseye [13:14:17] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1001 for host mw1470.eqiad.wmnet with OS bullseye [13:19:11] 10serviceops, 10All-and-every-Wikisource, 10Thumbor, 10MW-1.41-notes (1.41.0-wmf.13; 2023-06-13), 10Patch-For-Review: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 (10KTT-Commons) >>! In T337649#9378880, @hnowlan wrote: > We rec... [13:39:03] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1001 for host mw1464.eqiad.wmnet with OS bullseye completed: - mw1464 (**PASS**) - Downtimed on I... [13:43:48] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1001 for host mw1465.eqiad.wmnet with OS bullseye completed: - mw1465 (**PASS**) - Downtimed on I... [13:48:08] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1001 for host mw1470.eqiad.wmnet with OS bullseye completed: - mw1470 (**PASS**) - Downtimed on I... [13:56:07] hello folks [13:56:16] I filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/980391/ to upgrade rec-api to nodejs 19 [13:56:19] err 18 sorry [13:56:31] so if you are ok, I'll deploy it to staging, do some basic tests and then prod [14:04:08] elukey: <3 ack [14:15:25] 10serviceops, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 10 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10Jdforrester-WMF) [14:15:35] 10serviceops, 10API Platform (RESTbase Deprecation Roadmap): Migrate node-based services in production to node16 - https://phabricator.wikimedia.org/T308371 (10Jdforrester-WMF) [14:15:45] 10serviceops, 10SRE, 10API Platform (RESTbase Deprecation Roadmap): Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Jdforrester-WMF) [14:15:56] 10serviceops, 10ChangeProp, 10EventStreams, 10Image-Suggestion-API, and 4 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Jdforrester-WMF) [14:17:40] 10serviceops, 10Machine-Learning-Team: Rename the envoy's uses_ingress option to sets_sni - https://phabricator.wikimedia.org/T346638 (10elukey) a:05elukey→03None [14:21:22] 10serviceops, 10Machine-Learning-Team: Bump istio and Cert Manager Docker images to Bullseye - https://phabricator.wikimedia.org/T351933 (10elukey) Cert Manager deployed in staging envs, the plan is to leave it running for 2/3 days to see new certs issued. Once done, we can rollout to prod and close. [14:34:59] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Jhancock.wm) [14:41:38] not sure if I mentioned, but with node 18 calls to localhost:6500 end up by default using ipv6, and [::]:6500 results in connection refused [14:41:46] I filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/980407 [14:42:00] IIRC we had to do something similar for eventgate too [14:42:28] (nothing to worry about but I thought to mention it for awareness) [14:57:33] deployed the new rec-api, I tested it on staging and it worked fine [14:57:49] don't know all the endpoints/URLs, in case somebody complains it is my fault [14:58:09] cc: James_F: --^ [14:58:11] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1001 for host mw1471.eqiad.wmnet with OS bullseye [14:58:23] Ack. [15:11:20] James_F: I see some errors for metrics publishing that were warnings before, and now the only big downside is https://grafana.wikimedia.org/d/Y5wk80oGk/recommendation-api (no metrics after the deploy) [15:12:03] Ah, is this the statsd -> prometheus migration? [15:12:27] * James_F tries to remember what magic incantation is needed to re-fix the metrics. [15:13:52] https://logstash.wikimedia.org/goto/b87e7c0afd05cac71f1fab218e53322b also has a few WORKER TIMEOUT errors, though no more than before. [15:19:19] I see stuff like [15:19:20] {"name":"recommendation-api","hostname":"recommendation-api-production-5459988bb6-g4n7q","pid":17,"level":"ERROR","levelPath":"error/metrics","msg":"endTiming() unsupported for metric type Gauge","time":"2023-12-05T15:18:14.563Z","v":0} [15:20:23] the same was a warning (to be deprecated) before [15:20:28] Right. [15:21:43] there is some code in util.js referencing it [15:21:52] wondering what we need to use instead [15:27:52] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Jhancock.wm) attempted to provision sessionstore2004 on the new lsw switch. needs further attention. [15:31:40] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1001 for host mw1471.eqiad.wmnet with OS bullseye completed: - mw1471 (**PASS**) - Downtimed on I... [15:31:59] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Jhancock.wm) [15:39:48] 10serviceops, 10All-and-every-Wikisource, 10Thumbor, 10MW-1.41-notes (1.41.0-wmf.13; 2023-06-13), 10Patch-For-Review: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 (10hnowlan) We've removed the expensive file format throttling e... [16:02:58] James_F: for the moment we could just rollback to the prev version, and then possibly have a fix during the next days? If you have time to help I'd be glad, my nodejs knowledge is limited :D [16:05:02] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/979703 [16:10:56] rolled back, metrics are displayed again [16:11:05] first test didn't go well :D [16:13:55] elukey: Yeah, I believe I saw a commit from someone fixing a related issue, but I've been trying to find it without success. [16:27:03] ack lemme know! [16:27:20] I imagine an SREer might know better though. [17:07:08] 10serviceops, 10Dumps-Generation, 10MW-on-K8s, 10Release-Engineering-Team: Migrate current-generation dumps to run from our containerized images - https://phabricator.wikimedia.org/T352650 (10Milimetric) This sounds like it would work... but I do want to point out a potential maintenance issue: The three...