[06:37:21] <_joe_> can someone with some basic understanding of how kubernetes deployments work and how mediawiki works take a look at https://wikitech.wikimedia.org/wiki/MediaWiki_On_Kubernetes/How_it_works and leave me some feedback? [06:37:47] <_joe_> the goal of that page is to allow any SRE to understand how the whole thing is built and eventually modify it if they need to [07:43:03] _joe_: I may not be that person, but it seems reasonably clear to me; though you say the php7.4-cli image installs excimer and that php7.4-fpm-multiversion-base also does so? and "we don't need a rolling restart of the pods" appears twice in the mediawiki-mcrouter section, once seemingly in the middle of a link [08:06:46] <_joe_> Please be bold and fix the obvious mistakes :) [08:07:00] <_joe_> but yeah on the excimer thing, I need to clarify it [08:24:44] <_joe_> dcausse: can I ask you for a +1 for https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/939702/ ? [08:24:59] <_joe_> would you be ok with deploying this change today in case? [08:26:13] _joe_: sure, looking [08:26:30] <_joe_> it's the move of rdf-streaming-updated to use the k8s api [08:26:42] <_joe_> s/ed/er/ :) [08:29:32] _joe_: thanks! I'll take care of the rdf-streaming-updater deploy once ready, no problem [08:29:46] <_joe_> oh great :) [08:29:53] <_joe_> yeah I need to some prep first [08:40:32] <_joe_> dcausse: at your earliest convenience, we can merge the change [08:41:30] _joe_: please merge you want [08:41:35] *when [08:50:07] _joe_: I forgot about the test we run in dse-k8s to test the new flink deployment model (https://gerrit.wikimedia.org/r/940870) [08:50:23] <_joe_> ah [08:50:27] <_joe_> sorry I wasn't aware :) [08:50:31] my bad [08:53:47] <_joe_> dcausse: I +1'd it [08:54:32] _joe_: thanks! do I need to wait for the change to propagate or can I go ahead? [08:54:54] I'm seeing a (straightforward) +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/940868 if someone has two minutes [08:55:04] <_joe_> dcausse: what do you mean? [08:55:54] <_joe_> we have already set up the puppet part, let me make sure of it [08:56:46] <_joe_> dcausse: yeah go ahead at your earliest convenience :) [08:56:57] _joe_: thanks! deploying [08:58:47] cheers _joe_ [08:59:07] <_joe_> godog: hehe I was about to tell you I did give you the +1 [08:59:39] yeah! looking forward to wrapping this up [09:13:17] <_joe_> dcausse: https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&refresh=30s&var-dc=codfw%20prometheus%2Fk8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main&var-container_name=All requests flowing to codfw as well [09:13:39] <_joe_> so we're now serving rdf-streaming-updater more efficiently :) [09:13:47] yes seems to work well, latencies dropping as well, thanks! :) [09:15:16] <_joe_> cool :) [09:17:21] will deploy the change in dse-k8s and we should be good [09:18:06] <_joe_> great, thanks again fro the help :) [09:19:14] np! [09:38:24] can't see envoy telemetry from k8s-dse but looking at the envoy config deployed via configmaps I see the new endpoint being used, going to assume that it's working as expected but please let me know if not [09:43:56] <_joe_> https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-app=flink-session-cluster-taskmanager&var-destination=mw-api-int-async-ro&var-destination=mwapi-async&viewPanel=6 interesting [09:44:38] <_joe_> p90 plummeted but p99 went up [09:58:31] not sure to understand the p99s but from my end this way better overall [10:01:18] <_joe_> I would say that some specific requests are slower than they were before, while on average you get the advantage of being dc-local [10:02:19] <_joe_> dcausse: eqiad's situation is less rosy [10:02:29] <_joe_> so I'll take a better look at what's going on [10:03:32] thanks! [10:15:59] <_joe_> dcausse: are the rdf-stream-updaters in codfw and eqiad listening to the same topics? [10:16:10] <_joe_> and each updating the local WDQS cluster? [10:21:07] _joe_: yes they do exactly the same thing [10:21:26] <_joe_> ok then we have some checks to perform I guess [10:22:25] <_joe_> dcausse: ahhh snap, in eqiad we have twice the requests because of the dse cluster [10:22:44] <_joe_> which might in turn have some limitations of its own [10:24:16] I can stop our test running from dse-k8s if this helps [10:24:21] <_joe_> but most importantly, seems like it's causing some cpu throttling int he eqiad mw-api-int cluster [10:24:28] <_joe_> dcausse: nah it's ok