[06:19:39] 10serviceops, 10MW-on-K8s, 10Quality-and-Test-Engineering-Team (QTE): Run QTE test suite on testwiki on kubernetes - https://phabricator.wikimedia.org/T337489 (10dom_walden) @Clement_Goubert The QTE team has finished its testing of testwiki. [06:21:20] 10serviceops, 10MW-on-K8s, 10Quality-and-Test-Engineering-Team (QTE): Run QTE test suite on testwiki on kubernetes - https://phabricator.wikimedia.org/T337489 (10dom_walden) I ran the selenium tests for MediaWiki core and a few extentions against testwiki. The only failures appear to be either because testw... [06:39:52] Hi folks morning [06:40:17] I'll deploy later on changeprop-jobqueue to pick up the linger.ms settings [06:40:34] lemme know if you prefer otherwise [07:19:13] 10serviceops, 10SRE, 10Traffic, 10envoy, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [07:22:38] 10serviceops, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10Jgiannelos) I think for wikiwand we only allow requests based on referer should we add or replace the rule with the user... [07:34:36] 10serviceops, 10MW-on-K8s: Allow deployers to get a php REPL environment inside the mw-debug pods - https://phabricator.wikimedia.org/T341197 (10Joe) 05Open→03Resolved [07:34:42] 10serviceops, 10Prod-Kubernetes, 10Toolhub, 10Kubernetes, 10Patch-For-Review: Maintenance environment needed for running one-off commands - https://phabricator.wikimedia.org/T290357 (10Joe) [07:35:01] done :) [07:39:04] so far no improvements [07:39:09] so the next step is https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/936515 [08:21:31] 10serviceops, 10Data-Persistence, 10Patch-For-Review: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over - https://phabricator.wikimedia.org/T340843 (10KartikMistry) [08:21:40] 10serviceops, 10CX-cxserver, 10Kubernetes, 10Language-Team (Language-2023-July-September), 10Patch-For-Review: cxserver: Section Mapping Database (m5) not accessible by certain region - https://phabricator.wikimedia.org/T341117 (10KartikMistry) 05Open→03In progress [08:49:55] 10serviceops, 10Machine-Learning-Team: Replace the current recommendation-api service with a newer version - https://phabricator.wikimedia.org/T338471 (10SCherukuwada) Thanks for all the inputs. I'd like to verify my expectations around service teams and infrastructure commitments first before I comment furth... [08:54:40] FYI, kubetcd1005 will briefly go down for a Ganeti reboot [08:55:53] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [08:55:58] 10serviceops, 10MW-on-K8s, 10Quality-and-Test-Engineering-Team (QTE): Run QTE test suite on testwiki on kubernetes - https://phabricator.wikimedia.org/T337489 (10Clement_Goubert) 05Open→03Resolved @dom_walden Great, thank you for the report! I've brought testwiki back to baremetal now. [09:13:15] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Kubernetes: Kubernetes v1.23 use PKI for service-account signing (instead of cergen) - https://phabricator.wikimedia.org/T329826 (10JMeybohm) [09:13:52] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Kubernetes: Kubernetes v1.23 use PKI for service-account signing (instead of cergen) - https://phabricator.wikimedia.org/T329826 (10JMeybohm) [09:18:28] claime: shall we go with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/936515 ? [09:19:13] elukey: I'm not sure I know enough about kafka to be sure about impact, but from what you've linked on the task + the explanation in the commit message I think it's worth a try [09:19:23] _joe_: opinion? ^ [09:20:20] elukey: do you need the prometheus config patch deployed first to see impact? [09:22:58] claime: it wouldn't hurt yes, I can wait for a review and then deploy that first [09:23:20] not 100% needed but more metrics are good (we have other ones that will show impact, hopefully) [09:25:01] elukey: ack, I don't see why we can't move forward with your change then [09:26:12] thank yooou for the review, I am going to wait for the prometheus change to be deployed and then I'll do it [09:27:14] <_joe_> claime: yeah it is worth a try [09:38:16] prometheus change rolled out now [09:40:13] https://grafana-rw.wikimedia.org/d/000000027/kafka?forceLogin&orgId=1&from=now-3h&to=now&viewPanel=75 [09:40:26] will give it some time to collect metrics [09:41:08] ok wow I didn't see these values before, from a quick check via jconsole they seemed higher [09:41:34] on 3/5 of the kafka main nodes the request handler threads are very busy [09:41:41] that explains the latencies I think [09:42:13] 10serviceops, 10Maps: English Wikipedia maps have error 400 when retrieving the static map image for map with Commons data - https://phabricator.wikimedia.org/T341226 (10TheDJ) Example of edits needed to make these graphs work: https://commons.wikimedia.org/w/index.php?title=Data%3A2014_Northwest_Territories_f... [09:47:41] elukey: At first I was like "that doesn't seem high" then I realized it was idle percentage, not busy [09:47:55] And err yeah, they look very not idle x) [09:57:26] <_joe_> if kafka-main is underprovisioned, we have to cut things out [09:57:48] <_joe_> one big relief will be when we turn off parsoid pregeneration on restbase [10:03:55] talking about parsoid, there's something strange-ish going on [10:04:29] We got a spike in requests last night, which has completely disappeared, but latency increased *after*, and is still high [10:04:40] I imagine it may be reparses due to a template change? [10:11:20] <_joe_> claime: uhm let me take a look [10:11:47] There's a server missing (parse1012), it was taken out because of some CPU weirdness, I'll put it back since it hasn't alerted since [10:11:54] But it doesn't seem to be congestion imo [10:12:40] <_joe_> claime: we should probably take a look at the flamegraphs [10:32:27] sorry back [10:35:06] it makes sense now why moving partitions around makes things better [10:35:50] kafka100[45], that have the lowest number of partitions, show ~25% of idle time, that is good [10:36:28] if you check jumbo, that receives a ton more traffic, we go up to ~80% [10:36:53] I think that we haven't paid attention to these metrics in the past, we should also probably have some alarms [10:40:39] all right deploying eventgate [10:40:41] let's see [10:42:35] 10serviceops, 10RESTbase Sunsetting, 10Epic, 10Platform Engineering Roadmap: Replace usage of RESTbase parsoid endpoints - https://phabricator.wikimedia.org/T328559 (10DAlangi_WMF) [10:44:58] jayme: o/ I see some diffs for envoy-related configs (max requests per conn etc..) in the eventgate-main's diff [10:45:01] ok to deploy? [10:48:34] the parsoid latency spike corresponds with an increase in parsoidcacheprewarm jobs which hasn't really evened out (spiked at 7am, gradually decreasing over time) [10:49:03] jayme: ah I see https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/935754, proceeding :) [10:54:31] <_joe_> hnowlan: I don't think that's the issue tbh [10:57:05] claime: codfw done, I am going to have lunch before doing eqiad, so we can see if any metric/log goes awol in the timeframe [11:02:46] also lots of transclusion updates https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&refresh=1m&viewPanel=27 [11:08:01] elukey: yeah, sorry. Please go ahead [11:41:49] elukey: thanks <3 [12:18:02] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Direct 0.5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341078 (10Clement_Goubert) 05In progress→03Resolved [12:18:14] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [12:19:17] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463 (10Clement_Goubert) [12:19:34] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [12:19:46] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463 (10Clement_Goubert) 05Open→03In progress p:05Triage→03High [13:21:15] claime: deploying to eqiad :) [13:21:35] elukey: ack, be wary we have an outbound port saturation page rn [13:29:53] produce rate jump [13:30:57] may just be the restart [13:31:15] it is yes, the same happened to codfw, should go back to normal in a bit [13:31:43] Yeah it's just the restart, it's already gone down [13:31:59] I always get caught out by this with eventgate [13:36:18] it doesn't seem to have had a measurable effect on the main brokers, but the problem is definitely in the graph that we were checking this morning [13:37:11] RequestHandlerAvgIdlePercent ? [13:37:26] exactly yes, there is a big imbalance [13:37:49] and 5% is not good, all docs that I've read suggest 20/30% [13:38:27] so we may need to rebalance partitions again [13:38:41] to add more load to 1004/1005 [13:39:49] I suppose we have no way to actually know which topics have imbalanced partitions? [13:42:31] the big ones should all have one partition for each broker, but the following metric would probably need to be improved [13:42:34] https://grafana.wikimedia.org/d/000000027/kafka?forceLogin&from=now-3h&orgId=1&to=now&var-cluster=kafka_main&var-datasource=thanos&var-disk_device=All&var-kafka_broker=All&var-kafka_cluster=main-eqiad&viewPanel=20 [13:42:40] https://grafana.wikimedia.org/d/000000027/kafka?forceLogin&from=now-3h&orgId=1&to=now&var-cluster=kafka_main&var-datasource=thanos&var-disk_device=All&var-kafka_broker=All&var-kafka_cluster=main-eqiad&viewPanel=48 [13:43:06] there is a ~5:1 ratio of partitions between 1001->1003 and 1004/1005 [13:43:41] every producer can send data only to the partition leader [13:44:11] and in this case we have a ~4:1 ratio of partition leaders running on 1001/3 and 1004/5 [13:44:35] Yeah that'd explain why 1004/5 are idling while the rest is basically doing nothing [14:10:39] 10serviceops, 10wikidiff2, 10Better-Diffs-2023, 10Community-Tech (CommTech-Kanban): Deploy wikidiff2 1.14.1 - https://phabricator.wikimedia.org/T340087 (10TheresNoTime) [16:07:03] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10bd808) Is there any particular reason that the "[ ] Wikitech is ideal to dogfood mw-on-k8s, there are challenges though that we need to over come T292707" step w... [16:14:32] 10serviceops, 10Prod-Kubernetes, 10Toolhub, 10Kubernetes, 10Patch-For-Review: Maintenance environment needed for running one-off commands - https://phabricator.wikimedia.org/T290357 (10bd808) There is now support for a REPL for MediaWiki on Kubernetes from {T341197}. This might be seen as similar to @ako... [16:39:55] 10serviceops, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10Jgiannelos) 05Resolved→03Open [16:42:27] 10serviceops, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10Jgiannelos) From comms with wikiwand: It seems User-Agent and Api-User-Agent (for client-side requests) are ignored, ca... [16:52:04] 10serviceops, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10akosiaris) `Wikiwand/0.1 (https://www.wikiwand.com; admin@wikiwand.com)` added to the list of user-agents. Please advise... [17:21:15] 10serviceops, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10SRE, 10Growth-Team (Current Sprint): linkrecommendation kubernetes service is down with HTTP 504: "upstream request timeout" - https://phabricator.wikimedia.org/T340780 (10Urbanecm_WMF) [17:48:27] 10serviceops, 10Thumbor, 10Patch-For-Review, 10Platform Team Workboards (Platform Engineering Reliability): Upgrade Thumbor to bullseye - https://phabricator.wikimedia.org/T336881 (10Ladsgroup) I understand, my point is around priorities of what to upgrade first: i.e. the order of OS upgrade roll out in th... [18:39:03] Hey, I'm decommissioning dbproxy10[12-17] and they are mentioned in two helm charts: https://gerrit.wikimedia.org/g/operations/deployment-charts/+/225c3b9caad5ed7844ce1cf0c281c3cc0fda0e75/helmfile.d/services/ipoid/values-eqiad.yaml and https://gerrit.wikimedia.org/g/operations/deployment-charts/+/225c3b9caad5ed7844ce1cf0c281c3cc0fda0e75/helmfile.d/services/linkrecommendation/values.yaml [18:39:08] we probably need to update them [18:39:23] See T341121