[00:33:20] 10serviceops, 10SRE, 10noc.wikimedia.org: make noc.wikimedia.org active/active (was: improve mw maintenance server switch over and discovery names) - https://phabricator.wikimedia.org/T265936 (10Krinkle) 05Open→03Resolved a:03Joe Presumed fixed by {T341859}. In particular: * [operations/puppet] servi... [00:57:58] 10serviceops, 10noc.wikimedia.org: Evaluate alternative for noc.wikimedia.org/dbconfig/ file server - https://phabricator.wikimedia.org/T343398 (10Krinkle) [00:58:13] 10serviceops, 10noc.wikimedia.org: Evaluate alternative for noc.wikimedia.org/dbconfig/ file server - https://phabricator.wikimedia.org/T343398 (10Krinkle) [05:28:01] 10serviceops, 10noc.wikimedia.org: Evaluate alternative for noc.wikimedia.org/dbconfig/ file server - https://phabricator.wikimedia.org/T343398 (10Joe) FWIW dbtools should use conftool as a python library and fetch the dbconfig that way. I'll shame @Ladsgroup until modifying it :) Tbh I think noc's URL struct... [06:22:31] Hi folks! [06:22:40] I filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/945487/ to update eventgate-main's docker image [06:22:48] since I'd need to bump a version of a schema [06:23:07] lemme know if you have anything against me deploying it [06:46:57] <_joe_> elukey: deploying services for reconfiguration is self-service :) [06:47:11] <_joe_> including rolling back if something's broken :) [06:50:11] _joe_ oh yes but since it is jobqueues-related I always ask :) [06:53:00] <_joe_> elukey: you're too nice as usual [07:06:42] 10serviceops, 10SRE, 10Wikimedia-Apache-configuration: Incorrect handling of ETags taking precedence over timestamps in conditional requests - https://phabricator.wikimedia.org/T320241 (10Ifrahkhanyaree) [07:35:29] 10serviceops, 10All-and-every-Wikisource, 10Thumbor, 10MW-1.41-notes (1.41.0-wmf.13; 2023-06-13): Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 (10Xover) And the current status is that it takes 5–10 reloads, with a timeout in the... [07:56:41] (deployed only to codfw canary, all good, will complete the rollout later on) [08:30:16] 10serviceops, 10Content-Transform-Team-WIP, 10RESTbase Sunsetting, 10Wikifeeds, and 2 others: Switchover plan from restbase to api gateway for wikifeeds - https://phabricator.wikimedia.org/T339119 (10Jgiannelos) Not specific to this specific ticket but while checking the difftests between restbase and rest... [08:56:43] (deployed :) [09:01:17] elukey: <3 [09:03:17] 10serviceops, 10All-and-every-Wikisource, 10Thumbor, 10MW-1.41-notes (1.41.0-wmf.13; 2023-06-13): Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 (10Xover) 429 response codes in codfw seems to have started jumping some time during J... [09:14:27] 10serviceops, 10Content-Transform-Team-WIP, 10RESTbase Sunsetting, 10Wikifeeds, and 2 others: Switchover plan from restbase to api gateway for wikifeeds - https://phabricator.wikimedia.org/T339119 (10Jgiannelos) From difftests: * Random endpoint redirects as expected * Announcements is return the same res... [09:15:22] 10serviceops, 10Content-Transform-Team-WIP, 10RESTbase Sunsetting, 10Wikifeeds, and 2 others: Switchover plan from restbase to api gateway for wikifeeds - https://phabricator.wikimedia.org/T339119 (10Jgiannelos) There were some missing headers but it should be fixed after this deployment chart bump: https:... [09:45:44] 10serviceops, 10noc.wikimedia.org: Evaluate alternative for noc.wikimedia.org/dbconfig/ file server - https://phabricator.wikimedia.org/T343398 (10Ladsgroup) >>! In T343398#9065294, @Joe wrote: > FWIW dbtools should use conftool as a python library and fetch the dbconfig that way. I'll shame @Ladsgroup until h... [10:06:10] 10serviceops, 10All-and-every-Wikisource, 10Thumbor, 10MW-1.41-notes (1.41.0-wmf.13; 2023-06-13), 10Patch-For-Review: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 (10hnowlan) >>! In T337649#9065655, @Xover wrote: > 429 response... [10:22:14] 10serviceops, 10Content-Transform-Team-WIP, 10RESTbase Sunsetting, 10Wikifeeds, and 2 others: Switchover plan from restbase to api gateway for wikifeeds - https://phabricator.wikimedia.org/T339119 (10Jgiannelos) I rerun the tests once again after the patch and it looks like we are good to go: * Failures f... [10:43:53] 10serviceops, 10MW-on-K8s, 10SRE: MW-on-k8s traffic logs fewer errors than expected (increase in jsonTruncated) - https://phabricator.wikimedia.org/T343390 (10Clement_Goubert) After a quick check, it appears we are setting ` $MaxMessageSize 64k ` on mw bare metal hosts and not on kubernetes. Patch incoming. [10:44:15] 10serviceops, 10MW-on-K8s, 10SRE: MW-on-k8s traffic logs fewer errors than expected (increase in jsonTruncated) - https://phabricator.wikimedia.org/T343390 (10Clement_Goubert) 05Open→03In progress p:05Triage→03High a:03Clement_Goubert [10:44:28] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [10:49:20] 10serviceops, 10Content-Transform-Team-WIP, 10RESTbase Sunsetting, 10Wikifeeds, and 3 others: Switchover plan from restbase to api gateway for wikifeeds - https://phabricator.wikimedia.org/T339119 (10Jgiannelos) Lets hold on this because it looks like wikifeeds doesn't handle onthisday URL routing right. I... [11:33:37] 10serviceops, 10All-and-every-Wikisource, 10Thumbor, 10MW-1.41-notes (1.41.0-wmf.13; 2023-06-13): Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 (10Xover) >>! In T337649#9065813, @hnowlan wrote: > I'm going to reduce concurrency th... [11:44:38] 10serviceops, 10Content-Transform-Team-WIP, 10RESTbase Sunsetting, 10Wikifeeds, and 3 others: Switchover plan from restbase to api gateway for wikifeeds - https://phabricator.wikimedia.org/T339119 (10Jgiannelos) Just tested the fix on staging and looks OK. [11:51:11] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: MW-on-k8s traffic logs fewer errors than expected (increase in jsonTruncated) - https://phabricator.wikimedia.org/T343390 (10Clement_Goubert) Deployed, I'll check logstash periodically to see if it was enough to fix the majority of cases. [12:06:08] 10serviceops, 10Content-Transform-Team-WIP, 10RESTbase Sunsetting, 10Wikifeeds, and 3 others: Switchover plan from restbase to api gateway for wikifeeds - https://phabricator.wikimedia.org/T339119 (10Jgiannelos) Something that we've not captured is `/feed/availability` which is only served under wikimedia.... [12:08:51] 10serviceops, 10Content-Transform-Team-WIP, 10RESTbase Sunsetting, 10Wikifeeds, and 3 others: Switchover plan from restbase to api gateway for wikifeeds - https://phabricator.wikimedia.org/T339119 (10Jgiannelos) [12:23:19] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: MW-on-k8s traffic logs fewer errors than expected (increase in jsonTruncated) - https://phabricator.wikimedia.org/T343390 (10Clement_Goubert) The configuration option hasn't been taken up by the rsyslog containers in the pod, because it's a configmap ch... [12:52:48] 10serviceops, 10Abstract Wikipedia team, 10MW-on-K8s, 10SRE, and 2 others: Varnish/ATS are occasionally responding to Wikifunctions object page reads with a 404 even though `cache;desc="pass"` is set on normal requests - https://phabricator.wikimedia.org/T343440 (10Clement_Goubert) a:03Clement_Goubert [12:54:41] 10serviceops, 10Abstract Wikipedia team, 10MW-on-K8s, 10SRE, and 2 others: Varnish/ATS are occasionally responding to Wikifunctions object page reads with a 404 even though `cache;desc="pass"` is set on normal requests - https://phabricator.wikimedia.org/T343440 (10Clement_Goubert) I'll prepare a patch to... [13:04:00] 10serviceops, 10Abstract Wikipedia team, 10MW-on-K8s, 10SRE, and 2 others: Varnish/ATS are occasionally responding to Wikifunctions object page reads with a 404 even though `cache;desc="pass"` is set on normal requests - https://phabricator.wikimedia.org/T343440 (10Jdforrester-WMF) p:05Triage→03High [13:13:24] 10serviceops, 10Abstract Wikipedia team, 10MW-on-K8s, 10SRE, and 2 others: mw-on-k8s responds 404 for Wikifunctions view pages - https://phabricator.wikimedia.org/T343440 (10Clement_Goubert) [13:19:12] 10serviceops, 10Abstract Wikipedia team, 10MW-on-K8s, 10SRE, and 3 others: mw-on-k8s responds 404 for Wikifunctions view pages - https://phabricator.wikimedia.org/T343440 (10Clement_Goubert) It's a missing rewrite rule in mediawiki::sites [13:55:30] 10serviceops, 10Abstract Wikipedia team, 10function-evaluator, 10function-orchestrator: Split the wikifunctions k8s pod up in production so we have differently-scalable pods for the orchestrator vs. the evaluator - https://phabricator.wikimedia.org/T343459 (10Jdforrester-WMF) [14:54:37] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [14:55:17] 10serviceops, 10MW-on-K8s, 10SRE: MW-on-k8s traffic logs fewer errors than expected (increase in jsonTruncated) - https://phabricator.wikimedia.org/T343390 (10Clement_Goubert) 05In progress→03Resolved The patch has been live for a few hours, and jsontruncated messages from mw-on-k8s are now on the same b... [15:05:33] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Wikikube CPU capacity issue - https://phabricator.wikimedia.org/T343306 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host kubernetes1025.eqiad.wmnet with OS bullseye [15:20:56] 10serviceops, 10Abstract Wikipedia team, 10MW-on-K8s, 10SRE, and 2 others: mw-on-k8s responds 404 for Wikifunctions view pages - https://phabricator.wikimedia.org/T343440 (10Clement_Goubert) ` cgoubert@deploy2002:~$ curl -s --insecure -v -H "Host: www.wikifunctions.org" https://mwdebug.discovery.wmnet:4444... [15:27:54] 10serviceops, 10Abstract Wikipedia team, 10MW-on-K8s, 10SRE, and 2 others: mw-on-k8s responds 404 for Wikifunctions view pages - https://phabricator.wikimedia.org/T343440 (10Clement_Goubert) 05Open→03Resolved ` ❯ for i in {1..100}; do curl -s -v https://www.wikifunctions.org/view/en/Z10000 -o /dev/null... [15:30:45] 10serviceops, 10Abstract Wikipedia team, 10MW-on-K8s, 10SRE, and 2 others: mw-on-k8s responds 404 for Wikifunctions view pages - https://phabricator.wikimedia.org/T343440 (10Jdforrester-WMF) Thank you! [16:57:05] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Wikikube CPU capacity issue - https://phabricator.wikimedia.org/T343306 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host kubernetes1025.eqiad.wmnet with OS bullseye completed: - kubernetes1025 (**WARN**) - Remo...