[02:53:42] <wikibugs>	 10serviceops, 10MediaWiki-libs-Stats, 10Observability-Metrics: Decide on default histogram buckets for MediaWiki timers - https://phabricator.wikimedia.org/T344751 (10lmata) >>! In T344751#9292738, @gerritbot wrote: > Change 954114 **merged** by Herron: > %%%[operations/puppet@production] profile::mediawiki:...
[08:12:03] <wikibugs>	 10serviceops, 10envoy, 10observability, 10Patch-Needs-Improvement: Envoy should listen on ipv6 and ipv4 - https://phabricator.wikimedia.org/T255568 (10fgiunchedi)
[08:12:25] <wikibugs>	 10serviceops, 10envoy, 10Patch-For-Review: Using port in Host header for thanos-swift / thanos-query breaks vhost selection - https://phabricator.wikimedia.org/T300119 (10fgiunchedi)
[12:59:26] <ottomata>	 elukey: just saw email to reschedule, sounds good.
[13:00:04] <ottomata>	 qq though: how did you run perf in our k8s?!  I can  maybe get a profile log output, and I can connect an inspect debugger, but i don't know how you ran perf in our k8s?
[13:00:07] <ottomata>	 or did you just do it locally?
[13:00:08] <elukey>	 ottomata: yeah sorry I have some errands this afternoon! But we can briefly chat if you want on IRC, I posted all procedures in the task
[13:00:29] <elukey>	 yes I ran per on the kubestage100x nodes
[13:00:37] <elukey>	 *perf
[13:01:13] <ottomata>	 oh, so log into the nodes themselves, figure out the docker pid?
[13:01:27] <elukey>	 correct
[13:01:36] <ottomata>	 thank you, okay i can probably work with that!
[13:02:07] <ottomata>	 was  --perf-basic-prof-only-functions at all helpful?
[13:02:10] <elukey>	 ottomata: one caveat - if you use the perf command line option, you'll see a map file generated in /tmp/etc.., but in the container's namespace
[13:02:37] <elukey>	 so you'll have  to use something like docker cp to copy it on the kubestage's root namespace
[13:02:38] <ottomata>	 k
[13:02:44] <elukey>	 changing the pid too
[13:02:46] <ottomata>	 k
[13:02:53] <elukey>	 (in the filename)
[13:03:28] <elukey>	 https://phabricator.wikimedia.org/T348950#9286732
[13:04:16] <elukey>	 https://phabricator.wikimedia.org/T348950#9290196 is also relevant, since librdkafka afaics is built by node-rdkafka with optimizations, so you may see some "unknowns" in the flame graph
[13:04:31] <elukey>	 I was able to solve some of them with --call-graph lbr
[13:04:31] <ottomata>	 right okay
[13:05:36] <ottomata>	 thank you
[13:06:47] <elukey>	 np! 
[13:34:29] <wikibugs>	 10serviceops, 10MediaWiki-General, 10MediaWiki-libs-Stats, 10SRE, and 5 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10herron)
[13:34:37] <wikibugs>	 10serviceops, 10MediaWiki-libs-Stats, 10Observability-Metrics: Decide on default histogram buckets for MediaWiki timers - https://phabricator.wikimedia.org/T344751 (10herron) 05Open→03Resolved a:03herron >>! In T344751#9293584, @lmata wrote: >>>! In T344751#9292738, @gerritbot wrote: >> Change 954114 *...
[14:05:26] <wikibugs>	 10serviceops, 10MediaWiki-General, 10MediaWiki-libs-Stats, 10SRE, and 5 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10herron)
[14:20:19] <wikibugs>	 10serviceops, 10MediaWiki-General, 10MediaWiki-libs-Stats, 10SRE, and 5 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10herron)
[15:26:20] <ottomata>	 elukey:  lemme know if you are back and have some time, i have some flamegraphs, but i'm feeling like i got the wrong ones maybe...   the profiler log looks sort of more useful though.  
[15:26:45] <ottomata>	 i'm also trying to get a flame graph on staging where i can maybe figure out why GET requests take so long
[15:27:09] <ottomata>	 but erg, having a problem with staging + 'production' release deployment, something weird in the chart templates...
[15:27:11] <elukey>	 ottomata: I am back yes
[15:27:31] <ottomata>	 got time for a batcave ? :)
[15:27:44] <elukey>	 sure
[15:29:28] <elukey>	 hnowlan: thanks for the changeprop reviews :)
[15:29:42] <elukey>	 the chart refactor worked nicely in staging
[15:30:08] <elukey>	 so it should be very easy now to activate debug logging (in case)
[15:30:15] <ottomata>	 elukey:  started a huddle with you in slack
[15:30:16] <elukey>	 (for node-rdkafka)
[15:30:29] <elukey>	 a huddle? :D
[15:30:31] <elukey>	 gimme 1 sec
[16:15:00] <_joe_>	 elukey: yeah true slack people now use videocalls in slack
[16:15:18] <_joe_>	 it is subpar compared to gmeet, but you don't need to click on another browser
[16:20:12] <elukey>	 :D
[16:20:42] <ottomata>	 i think screen sharing in slack better than gmeet, but really whatever.  easier to impromptu call your buds and look at flame graphgs on slack i think :)
[16:20:50] <ottomata>	 now if only we could vidchat and screen share in IRC... :)
[16:29:49] <elukey>	 ottomata: luckily Luca from the past added https://phabricator.wikimedia.org/P53042
[16:30:12] <elukey>	 964    0.1%   12.4%  GC
[16:30:43] <elukey>	 so it could very well be GC time
[16:32:05] <ottomata>	 ah ha, you see it too
[16:32:16] <ottomata>	 does changeprop mem consumption go down a little in the same way?
[16:33:24] <elukey>	 it does yes
[16:35:31] <elukey>	 hnowlan: fyi I deployed the new changeprop chart refactoring, basically a no-op
[16:35:48] <hnowlan>	 nice!
[16:36:25] <elukey>	 ottomata: there is the --trace_gc option to use, but probably a little hard to parse
[16:38:42] <elukey>	 ottomata: I also have https://phabricator.wikimedia.org/P53054, that shows a similar usage (for node 10)
[16:40:16] <elukey>	 hnowlan: if you agree I'd  test https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/969758 in eqiad
[16:40:59] <elukey>	 we observe how it behaves with real traffic, and how the cpu increase looks like
[16:46:21] * elukey tries
[16:49:30] <hnowlan>	 elukey: go for it 
[16:49:56] <elukey>	 <3
[16:50:03] <elukey>	 deployed, checking graphs
[16:59:18] <elukey>	 as far as I can see, no cpu usage increase
[17:00:09] <elukey>	 no sorry, wrong graphs sigh
[17:00:28] <hnowlan>	 hmm, did the metrics agent die again :/ 
[17:00:53] <hnowlan>	 No metrics before your deploy 
[17:01:17] <elukey>	 I saw yes, but I didn't see throttling for the prometheus exporter this time
[17:01:24] <hnowlan>	 regardless, those are spicy jumps :( 
[17:01:49] <hnowlan>	 however, they're not throttled and we haven't increased capacity. Could we just live with it?
[17:01:57] <elukey>	 the per-container graph seems to be the relevant one, we went from 50ms to 100ms more or less
[17:02:04] <hnowlan>	 like they're a doubling somewhat similar to what we saw in staging 
[17:02:10] <elukey>	 exactltly
[17:02:11] <hnowlan>	 so maybe it's not related to throughput or workload 
[17:02:38] <elukey>	 the idea that Andrew raised about it being related to nodejs' GC could be a good one
[17:02:41] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10User-jijiki: Deploy kube-state-metrics - https://phabricator.wikimedia.org/T264625 (10kamila) 05Stalled→03In progress
[17:02:43] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10observability, 10Kubernetes: Increase visibility of container/pod ressource exhaustion - https://phabricator.wikimedia.org/T266216 (10kamila)
[17:03:05] <hnowlan>	 the real question short-term is do we see an increase in backlogs/a decrease in throughput 
[17:03:50] <elukey>	 we request 500ms of cpu and limit 1s
[17:03:58] <elukey>	 afaics from the pod's spec
[17:04:48] <elukey>	 I'd leave it running for a day or two, and then decide for codfw
[17:04:52] <elukey>	 what do you think?
[17:06:40] <hnowlan>	 yeah I'm okay with that if the backlog doesn't keep going up 
[17:07:28] <hnowlan>	 it's already recovered 
[17:08:02] <elukey>	 the interesting bit is that the network graphs shows a decrease
[17:08:13] <elukey>	 the other metrics look good
[17:08:18] <elukey>	 exec time etc..
[17:08:57] <elukey>	 memory decreased but usually it does that after deployments, so we'll see in the long run
[17:12:11] <hnowlan>	 yeah it's pretty decent looking 
[17:14:29] <elukey>	 posted also a msg in #sre
[17:15:03] <_joe_>	 the network decrease might be due to librdkafka doing things more efficiently from that POV now?
[17:16:05] <elukey>	 could be yes, but I didn't find solid proof in staging
[17:17:21] <elukey>	 there are definitely some code paths that are not performant on node 18, my bet is the promisify() and related async polling
[17:17:31] <elukey>	 (so how timers are handled etc..)
[17:36:46] * elukey afk!
[23:39:48] <wikibugs>	 10serviceops, 10MediaWiki-extensions-CentralAuth, 10Stewards-and-global-tools, 10WMF-JobQueue, and 2 others: Accounts taking 30+ minutes to autocreate on metawiki/loginwiki (2023-05) - https://phabricator.wikimedia.org/T336627 (10Superpes15) This is currently happening! I'm seeing some users (also LTAs) cr...