[06:19:39] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Quality-and-Test-Engineering-Team (QTE): Run QTE test suite on testwiki on kubernetes - https://phabricator.wikimedia.org/T337489 (10dom_walden) @Clement_Goubert The QTE team has finished its testing of testwiki.
[06:21:20] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Quality-and-Test-Engineering-Team (QTE): Run QTE test suite on testwiki on kubernetes - https://phabricator.wikimedia.org/T337489 (10dom_walden) I ran the selenium tests for MediaWiki core and a few extentions against testwiki.  The only failures appear to be either because testw...
[06:39:52] <elukey>	 Hi folks morning
[06:40:17] <elukey>	 I'll deploy later on changeprop-jobqueue to pick up the linger.ms settings
[06:40:34] <elukey>	 lemme know if you prefer otherwise
[07:19:13] <wikibugs>	 10serviceops, 10SRE, 10Traffic, 10envoy, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm)
[07:22:38] <wikibugs>	 10serviceops, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10Jgiannelos) I think for wikiwand we only allow requests based on referer should we add or replace the rule with the user...
[07:34:36] <wikibugs>	 10serviceops, 10MW-on-K8s: Allow deployers to get a php REPL environment inside the mw-debug pods - https://phabricator.wikimedia.org/T341197 (10Joe) 05Open→03Resolved
[07:34:42] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Toolhub, 10Kubernetes, 10Patch-For-Review: Maintenance environment needed for running one-off commands - https://phabricator.wikimedia.org/T290357 (10Joe)
[07:35:01] <elukey>	 done :)
[07:39:04] <elukey>	 so far no improvements
[07:39:09] <elukey>	 so the next step is https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/936515
[08:21:31] <wikibugs>	 10serviceops, 10Data-Persistence, 10Patch-For-Review: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over - https://phabricator.wikimedia.org/T340843 (10KartikMistry)
[08:21:40] <wikibugs>	 10serviceops, 10CX-cxserver, 10Kubernetes, 10Language-Team (Language-2023-July-September), 10Patch-For-Review: cxserver: Section Mapping Database (m5) not accessible by certain region - https://phabricator.wikimedia.org/T341117 (10KartikMistry) 05Open→03In progress
[08:49:55] <wikibugs>	 10serviceops, 10Machine-Learning-Team: Replace the current recommendation-api service with a newer version - https://phabricator.wikimedia.org/T338471 (10SCherukuwada) Thanks for all the inputs.  I'd like to verify my expectations around service teams and infrastructure commitments first before I comment furth...
[08:54:40] <moritzm>	 FYI, kubetcd1005 will briefly go down for a Ganeti reboot
[08:55:53] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[08:55:58] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Quality-and-Test-Engineering-Team (QTE): Run QTE test suite on testwiki on kubernetes - https://phabricator.wikimedia.org/T337489 (10Clement_Goubert) 05Open→03Resolved @dom_walden Great, thank you for the report! I've brought testwiki back to baremetal now.
[09:13:15] <wikibugs>	 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Kubernetes: Kubernetes v1.23 use PKI for service-account signing (instead of cergen) - https://phabricator.wikimedia.org/T329826 (10JMeybohm)
[09:13:52] <wikibugs>	 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Kubernetes: Kubernetes v1.23 use PKI for service-account signing (instead of cergen) - https://phabricator.wikimedia.org/T329826 (10JMeybohm)
[09:18:28] <elukey>	 claime: shall we go with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/936515 ?
[09:19:13] <claime>	 elukey: I'm not sure I know enough about kafka to be sure about impact, but from what you've linked on the task + the explanation in the commit message I think it's worth a try
[09:19:23] <claime>	 _joe_: opinion? ^
[09:20:20] <claime>	 elukey: do you need the prometheus config patch deployed first to see impact?
[09:22:58] <elukey>	 claime: it wouldn't hurt yes, I can wait for a review and then deploy that first
[09:23:20] <elukey>	 not 100% needed but more metrics are good (we have other ones that will show impact, hopefully)
[09:25:01] <claime>	 elukey: ack, I don't see why we can't move forward with your change then
[09:26:12] <elukey>	 thank yooou for the review, I am going to wait for the prometheus change to be deployed and then I'll do it
[09:27:14] <_joe_>	 claime: yeah it is worth a try
[09:38:16] <elukey>	 prometheus change rolled out now
[09:40:13] <elukey>	 https://grafana-rw.wikimedia.org/d/000000027/kafka?forceLogin&orgId=1&from=now-3h&to=now&viewPanel=75
[09:40:26] <elukey>	 will give it some time to collect metrics
[09:41:08] <elukey>	 ok wow I didn't see these values before, from a quick check via jconsole they seemed higher
[09:41:34] <elukey>	 on 3/5 of the kafka main nodes the request handler threads are very busy
[09:41:41] <elukey>	 that explains the latencies I think
[09:42:13] <wikibugs>	 10serviceops, 10Maps: English Wikipedia maps have error 400 when retrieving the static map image for map with Commons data - https://phabricator.wikimedia.org/T341226 (10TheDJ) Example of edits needed to make these graphs work: https://commons.wikimedia.org/w/index.php?title=Data%3A2014_Northwest_Territories_f...
[09:47:41] <claime>	 elukey: At first I was like "that doesn't seem high" then I realized it was idle percentage, not busy
[09:47:55] <claime>	 And err yeah, they look very not idle x)
[09:57:26] <_joe_>	 if kafka-main is underprovisioned, we have to cut things out
[09:57:48] <_joe_>	 one big relief will be when we turn off parsoid pregeneration on restbase
[10:03:55] <claime>	 talking about parsoid, there's something strange-ish going on
[10:04:29] <claime>	 We got a spike in requests last night, which has completely disappeared, but latency increased *after*, and is still high
[10:04:40] <claime>	 I imagine it may be reparses due to a template change?
[10:11:20] <_joe_>	 claime: uhm let me take a look
[10:11:47] <claime>	 There's a server missing (parse1012), it was taken out because of some CPU weirdness, I'll put it back since it hasn't alerted since
[10:11:54] <claime>	 But it doesn't seem to be congestion imo
[10:12:40] <_joe_>	 claime: we should probably take a look at the flamegraphs
[10:32:27] <elukey>	 sorry back
[10:35:06] <elukey>	 it makes sense now why moving partitions around makes things better
[10:35:50] <elukey>	 kafka100[45], that have the lowest number of partitions, show ~25% of idle time, that is good
[10:36:28] <elukey>	 if you check jumbo, that receives a ton more traffic, we go up to ~80%
[10:36:53] <elukey>	 I think that we haven't paid attention to these metrics in the past, we should also probably have some alarms 
[10:40:39] <elukey>	 all right deploying eventgate
[10:40:41] <elukey>	 let's see
[10:42:35] <wikibugs>	 10serviceops, 10RESTbase Sunsetting, 10Epic, 10Platform Engineering Roadmap: Replace usage of RESTbase parsoid endpoints - https://phabricator.wikimedia.org/T328559 (10DAlangi_WMF)
[10:44:58] <elukey>	 jayme: o/ I see some diffs for envoy-related configs (max requests per conn etc..) in the eventgate-main's diff
[10:45:01] <elukey>	 ok to deploy?
[10:48:34] <hnowlan>	 the parsoid latency spike corresponds with an increase in parsoidcacheprewarm jobs which hasn't really evened out (spiked at 7am, gradually decreasing over time) 
[10:49:03] <elukey>	 jayme: ah I see https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/935754, proceeding :)
[10:54:31] <_joe_>	 hnowlan: I don't think that's the issue tbh
[10:57:05] <elukey>	 claime: codfw done, I am going to have lunch before doing eqiad, so we can see if any metric/log goes awol in the timeframe
[11:02:46] <hnowlan>	 also lots of transclusion updates https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&refresh=1m&viewPanel=27
[11:08:01] <jayme>	 elukey: yeah, sorry. Please go ahead
[11:41:49] <claime>	 elukey: thanks <3
[12:18:02] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Direct 0.5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341078 (10Clement_Goubert) 05In progress→03Resolved
[12:18:14] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[12:19:17] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463 (10Clement_Goubert)
[12:19:34] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[12:19:46] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463 (10Clement_Goubert) 05Open→03In progress p:05Triage→03High
[13:21:15] <elukey>	 claime: deploying to eqiad :)
[13:21:35] <claime>	 elukey: ack, be wary we have an outbound port saturation page rn
[13:29:53] <claime>	 produce rate jump
[13:30:57] <claime>	 may just be the restart
[13:31:15] <elukey>	 it is yes, the same happened to codfw, should go back to normal in a bit
[13:31:43] <claime>	 Yeah it's just the restart, it's already gone down
[13:31:59] <claime>	 I always get caught out by this with eventgate
[13:36:18] <elukey>	 it doesn't seem to have had a measurable effect on the main brokers, but the problem is definitely in the graph that we were checking this morning
[13:37:11] <claime>	 RequestHandlerAvgIdlePercent ?
[13:37:26] <elukey>	 exactly yes, there is a big imbalance
[13:37:49] <elukey>	 and 5% is not good, all docs that I've read suggest 20/30%
[13:38:27] <elukey>	 so we may need to rebalance partitions again
[13:38:41] <elukey>	 to add more load to 1004/1005 
[13:39:49] <claime>	 I suppose we have no way to actually know which topics have imbalanced partitions?
[13:42:31] <elukey>	 the big ones should all have one partition for each broker, but the following metric would probably need to be improved
[13:42:34] <elukey>	 https://grafana.wikimedia.org/d/000000027/kafka?forceLogin&from=now-3h&orgId=1&to=now&var-cluster=kafka_main&var-datasource=thanos&var-disk_device=All&var-kafka_broker=All&var-kafka_cluster=main-eqiad&viewPanel=20
[13:42:40] <elukey>	 https://grafana.wikimedia.org/d/000000027/kafka?forceLogin&from=now-3h&orgId=1&to=now&var-cluster=kafka_main&var-datasource=thanos&var-disk_device=All&var-kafka_broker=All&var-kafka_cluster=main-eqiad&viewPanel=48
[13:43:06] <elukey>	 there is a ~5:1 ratio of partitions between 1001->1003 and 1004/1005
[13:43:41] <elukey>	 every producer can send data only to the partition leader
[13:44:11] <elukey>	 and in this case we have a ~4:1 ratio of partition leaders running on 1001/3 and 1004/5
[13:44:35] <claime>	 Yeah that'd explain why 1004/5 are idling while the rest is basically doing nothing
[14:10:39] <wikibugs>	 10serviceops, 10wikidiff2, 10Better-Diffs-2023, 10Community-Tech (CommTech-Kanban): Deploy wikidiff2 1.14.1 - https://phabricator.wikimedia.org/T340087 (10TheresNoTime)
[16:07:03] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10bd808) Is there any particular reason that the "[ ] Wikitech is ideal to dogfood mw-on-k8s, there are challenges though that we need to over come T292707" step w...
[16:14:32] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Toolhub, 10Kubernetes, 10Patch-For-Review: Maintenance environment needed for running one-off commands - https://phabricator.wikimedia.org/T290357 (10bd808) There is now support for a REPL for MediaWiki on Kubernetes from {T341197}. This might be seen as similar to @ako...
[16:39:55] <wikibugs>	 10serviceops, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10Jgiannelos) 05Resolved→03Open
[16:42:27] <wikibugs>	 10serviceops, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10Jgiannelos) From comms with wikiwand:  It seems User-Agent and Api-User-Agent (for client-side requests) are ignored, ca...
[16:52:04] <wikibugs>	 10serviceops, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10akosiaris) `Wikiwand/0.1 (https://www.wikiwand.com; admin@wikiwand.com)` added to the list of user-agents. Please advise...
[17:21:15] <wikibugs>	 10serviceops, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10SRE, 10Growth-Team (Current Sprint): linkrecommendation kubernetes service is down with HTTP 504: "upstream request timeout" - https://phabricator.wikimedia.org/T340780 (10Urbanecm_WMF)
[17:48:27] <wikibugs>	 10serviceops, 10Thumbor, 10Patch-For-Review, 10Platform Team Workboards (Platform Engineering Reliability): Upgrade Thumbor to bullseye - https://phabricator.wikimedia.org/T336881 (10Ladsgroup) I understand, my point is around priorities of what to upgrade first: i.e. the order of OS upgrade roll out in th...
[18:39:03] <Amir1>	 Hey, I'm decommissioning dbproxy10[12-17] and they are mentioned in two helm charts: https://gerrit.wikimedia.org/g/operations/deployment-charts/+/225c3b9caad5ed7844ce1cf0c281c3cc0fda0e75/helmfile.d/services/ipoid/values-eqiad.yaml and https://gerrit.wikimedia.org/g/operations/deployment-charts/+/225c3b9caad5ed7844ce1cf0c281c3cc0fda0e75/helmfile.d/services/linkrecommendation/values.yaml 
[18:39:08] <Amir1>	 we probably need to update them
[18:39:23] <Amir1>	 See T341121