[02:53:42] 10serviceops, 10MediaWiki-libs-Stats, 10Observability-Metrics: Decide on default histogram buckets for MediaWiki timers - https://phabricator.wikimedia.org/T344751 (10lmata) >>! In T344751#9292738, @gerritbot wrote: > Change 954114 **merged** by Herron: > %%%[operations/puppet@production] profile::mediawiki:... [08:12:03] 10serviceops, 10envoy, 10observability, 10Patch-Needs-Improvement: Envoy should listen on ipv6 and ipv4 - https://phabricator.wikimedia.org/T255568 (10fgiunchedi) [08:12:25] 10serviceops, 10envoy, 10Patch-For-Review: Using port in Host header for thanos-swift / thanos-query breaks vhost selection - https://phabricator.wikimedia.org/T300119 (10fgiunchedi) [12:59:26] elukey: just saw email to reschedule, sounds good. [13:00:04] qq though: how did you run perf in our k8s?! I can maybe get a profile log output, and I can connect an inspect debugger, but i don't know how you ran perf in our k8s? [13:00:07] or did you just do it locally? [13:00:08] ottomata: yeah sorry I have some errands this afternoon! But we can briefly chat if you want on IRC, I posted all procedures in the task [13:00:29] yes I ran per on the kubestage100x nodes [13:00:37] *perf [13:01:13] oh, so log into the nodes themselves, figure out the docker pid? [13:01:27] correct [13:01:36] thank you, okay i can probably work with that! [13:02:07] was --perf-basic-prof-only-functions at all helpful? [13:02:10] ottomata: one caveat - if you use the perf command line option, you'll see a map file generated in /tmp/etc.., but in the container's namespace [13:02:37] so you'll have to use something like docker cp to copy it on the kubestage's root namespace [13:02:38] k [13:02:44] changing the pid too [13:02:46] k [13:02:53] (in the filename) [13:03:28] https://phabricator.wikimedia.org/T348950#9286732 [13:04:16] https://phabricator.wikimedia.org/T348950#9290196 is also relevant, since librdkafka afaics is built by node-rdkafka with optimizations, so you may see some "unknowns" in the flame graph [13:04:31] I was able to solve some of them with --call-graph lbr [13:04:31] right okay [13:05:36] thank you [13:06:47] np! [13:34:29] 10serviceops, 10MediaWiki-General, 10MediaWiki-libs-Stats, 10SRE, and 5 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10herron) [13:34:37] 10serviceops, 10MediaWiki-libs-Stats, 10Observability-Metrics: Decide on default histogram buckets for MediaWiki timers - https://phabricator.wikimedia.org/T344751 (10herron) 05Open→03Resolved a:03herron >>! In T344751#9293584, @lmata wrote: >>>! In T344751#9292738, @gerritbot wrote: >> Change 954114 *... [14:05:26] 10serviceops, 10MediaWiki-General, 10MediaWiki-libs-Stats, 10SRE, and 5 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10herron) [14:20:19] 10serviceops, 10MediaWiki-General, 10MediaWiki-libs-Stats, 10SRE, and 5 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10herron) [15:26:20] elukey: lemme know if you are back and have some time, i have some flamegraphs, but i'm feeling like i got the wrong ones maybe... the profiler log looks sort of more useful though. [15:26:45] i'm also trying to get a flame graph on staging where i can maybe figure out why GET requests take so long [15:27:09] but erg, having a problem with staging + 'production' release deployment, something weird in the chart templates... [15:27:11] ottomata: I am back yes [15:27:31] got time for a batcave ? :) [15:27:44] sure [15:29:28] hnowlan: thanks for the changeprop reviews :) [15:29:42] the chart refactor worked nicely in staging [15:30:08] so it should be very easy now to activate debug logging (in case) [15:30:15] elukey: started a huddle with you in slack [15:30:16] (for node-rdkafka) [15:30:29] a huddle? :D [15:30:31] gimme 1 sec [16:15:00] <_joe_> elukey: yeah true slack people now use videocalls in slack [16:15:18] <_joe_> it is subpar compared to gmeet, but you don't need to click on another browser [16:20:12] :D [16:20:42] i think screen sharing in slack better than gmeet, but really whatever. easier to impromptu call your buds and look at flame graphgs on slack i think :) [16:20:50] now if only we could vidchat and screen share in IRC... :) [16:29:49] ottomata: luckily Luca from the past added https://phabricator.wikimedia.org/P53042 [16:30:12] 964 0.1% 12.4% GC [16:30:43] so it could very well be GC time [16:32:05] ah ha, you see it too [16:32:16] does changeprop mem consumption go down a little in the same way? [16:33:24] it does yes [16:35:31] hnowlan: fyi I deployed the new changeprop chart refactoring, basically a no-op [16:35:48] nice! [16:36:25] ottomata: there is the --trace_gc option to use, but probably a little hard to parse [16:38:42] ottomata: I also have https://phabricator.wikimedia.org/P53054, that shows a similar usage (for node 10) [16:40:16] hnowlan: if you agree I'd test https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/969758 in eqiad [16:40:59] we observe how it behaves with real traffic, and how the cpu increase looks like [16:46:21] * elukey tries [16:49:30] elukey: go for it [16:49:56] <3 [16:50:03] deployed, checking graphs [16:59:18] as far as I can see, no cpu usage increase [17:00:09] no sorry, wrong graphs sigh [17:00:28] hmm, did the metrics agent die again :/ [17:00:53] No metrics before your deploy [17:01:17] I saw yes, but I didn't see throttling for the prometheus exporter this time [17:01:24] regardless, those are spicy jumps :( [17:01:49] however, they're not throttled and we haven't increased capacity. Could we just live with it? [17:01:57] the per-container graph seems to be the relevant one, we went from 50ms to 100ms more or less [17:02:04] like they're a doubling somewhat similar to what we saw in staging [17:02:10] exactltly [17:02:11] so maybe it's not related to throughput or workload [17:02:38] the idea that Andrew raised about it being related to nodejs' GC could be a good one [17:02:41] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10User-jijiki: Deploy kube-state-metrics - https://phabricator.wikimedia.org/T264625 (10kamila) 05Stalled→03In progress [17:02:43] 10serviceops, 10Prod-Kubernetes, 10observability, 10Kubernetes: Increase visibility of container/pod ressource exhaustion - https://phabricator.wikimedia.org/T266216 (10kamila) [17:03:05] the real question short-term is do we see an increase in backlogs/a decrease in throughput [17:03:50] we request 500ms of cpu and limit 1s [17:03:58] afaics from the pod's spec [17:04:48] I'd leave it running for a day or two, and then decide for codfw [17:04:52] what do you think? [17:06:40] yeah I'm okay with that if the backlog doesn't keep going up [17:07:28] it's already recovered [17:08:02] the interesting bit is that the network graphs shows a decrease [17:08:13] the other metrics look good [17:08:18] exec time etc.. [17:08:57] memory decreased but usually it does that after deployments, so we'll see in the long run [17:12:11] yeah it's pretty decent looking [17:14:29] posted also a msg in #sre [17:15:03] <_joe_> the network decrease might be due to librdkafka doing things more efficiently from that POV now? [17:16:05] could be yes, but I didn't find solid proof in staging [17:17:21] there are definitely some code paths that are not performant on node 18, my bet is the promisify() and related async polling [17:17:31] (so how timers are handled etc..) [17:36:46] * elukey afk! [23:39:48] 10serviceops, 10MediaWiki-extensions-CentralAuth, 10Stewards-and-global-tools, 10WMF-JobQueue, and 2 others: Accounts taking 30+ minutes to autocreate on metawiki/loginwiki (2023-05) - https://phabricator.wikimedia.org/T336627 (10Superpes15) This is currently happening! I'm seeing some users (also LTAs) cr...