[09:10:36] o/ dcausse: I would like to get some insights on memory/CPU usage amongst SUP flink operators. I saw that we can launch the application with jemalloc but in the end, I need that information from the executing task managers. Do you know, how to get that? [09:12:22] pfischer: hey [09:13:52] we disable jemalloc on purpose because it showed bad behaviors on the wdqs job but if you want to try it for the sup there should be a line in the helmfiles to comment [09:15:15] pfischer: at the end of helmfile.d/services/cirrus-streaming-updater/values.yaml simply flip the DISABLE_JEMALLOC env var [09:32:17] dcausse: Alright, I’ll give it a shot. Looking at the entrypoint script, I thought that LD_PRELOAD (which is defined if DISABLE_JEMALLOC != false) would only effect the container that launches the application (creates the graph) but not necessarily all the task managers finally executing the operators. [09:33:53] sure, haven't looked deeply in to how this is passed accross all startup scripts but unless it recently broke it should work, we saw noticeable change in the mem footprint when we disabled it [09:40:11] Seems like there are less intrusive ways of sampling, for example: https://github.com/async-profiler/async-profiler (which shopify suggests as one of their go-to tools for flink applications) [09:44:27] nice! [10:32:27] lunch [13:12:15] dcausse: async profiler works, the question now is: how do we get the dumps out of the production containers? My naive approach would be a script that executes the asprof commands and uploads the dumps to, for example, my public people directory (hosting a corresponding upload_dump.php) [13:13:39] pfischer: is this something you want to automate fully or a command you run from to time? [13:14:08] o/ [13:14:42] o/ [13:15:01] s/from to time/from time to time/ [13:15:56] dcausse: I would start with a command to be executed manually (from time to time) but would still need the resulting files [13:17:02] I don’t have kubectl exec permissions in the cirrus-streaming-updater namespace at the moment. inflatador: Is that sth. I do have to request somewhere? Do I have to be added to a group? [13:17:39] pfischer: this is something we have but you have to craft custom env vars to enable [13:18:15] if the image is properly packaged then it's just a matter of running kubectl exec [13:19:12] for fetching the file I see that "kubectl cp" exists, unsure if we have the rights to use it tho [13:19:12] `kubectl exec` is basically the same as `docker exec` , right? [13:21:12] trying to find how in my history, Erik told me how to a couple months ago [13:24:06] I don't have perms to `kubectl exec` either FWiW [13:24:56] I tried it as root and it didn't work so well [13:26:30] looking at others' cmd history on deploy1002 I see `kubectl exec -it machinetranslation-production-8589859b97-mht2q /bin/bash` , let me try this again [13:27:08] pfischer: KUBECONFIG=/etc/kubernetes/cirrus-streaming-updater-deploy-eqiad.config kubectl exec -ti -n cirrus-streaming-updater flink-app-consumer-cloudelastic-taskmanager-1-1 -c flink-main-container bash [13:30:19] for copying this seems to work: kubectl cp flink-app-consumer-cloudelastic-taskmanager-1-1:/etc/hosts ./hosts -n cirrus-streaming-updater -c flink-main-container [13:30:48] will test/ document that cmd [13:31:21] pfischer: then the thing I don't understand is the upload_dump.php script, what would this do? [14:14:03] dcausse: this would accept the dump file upload and store it. If I profile continuously for 10s-windows, each dump could automatically be uploaded. But I’ll start with manually copied dumps for now. [14:18:55] sure, we might want to look at how flamegraphs are exposed in https://performance.wikimedia.org/php-profiling/ for mw and possibly use the same ideas if possible [14:39:11] Hm, would you want to reuse the existing system? I’d have to check if async-profiler supports the format in expected by arch-lamp. However, async-profiler comes with built-in flame graph exports [14:44:07] yeah... I have no clue what it involves tbh, but we might want to look at how all this is automated/exposed and see if we can reused some of these ideas [14:46:54] I skimmed the project descriptions. The use a profiler (excimer) and a collector (arc-lamp) which collects all the logs into a redis store, from which the second half of arc-lamp (arclamp-log) aggregates them into flame graphs, grouped per hour/day. [14:47:54] sounds complicated :) [14:48:30] but we also want to somehow aggregate the various taskmanagers no? [14:59:55] \o [15:00:54] just realized that new completion suggestions from intellij are a lot smarter... e.g. start writing the first letter of an exception message like 'IllegalArgumentException("U' and completing with "nknown prefix: " + prefix); [15:01:27] haven't enabled their llm plugins or anything [15:01:29] o/ [15:01:55] o/ [15:01:56] dcausse: which version of the IDE are you using? [15:02:34] pfischer: IntelliJ IDEA 2024.1 (Ultimate Edition) [15:05:17] ah I see "Jetbrains AI assistant" in the list of plugins but it says "no license" so I suppose it's doing nothing [15:07:35] disabled it and I no longer get these suggestions, must be providing some basic features of the paid version [15:13:46] I’m off for now, will be back later tonight. [15:49:21] workout, back in ~40 [16:56:25] * ebernhardson is for some reason surprised to see `all_elasticsearch_requests_cached: True` for morelike requests while reviewing request logs. I mean it's supposed to do that, but nice to see :) [16:56:58] 80ms to serve the cached response is also not terrible [16:57:20] :/ [16:58:05] per security channel, "there have been some edits to some pretty impactful templates which are causing knock-on effects all over the place atm"...not sure if that affects that measurement [16:59:07] i'm looking at logs from 4/4 04:00, so probsably not :) But that will probably cause the updaters and job queues to be busier [17:02:45] other randomly curious things, we get intitle searches from mediawiki/1.42.0-wmf.24 user agent, but to mw-api-ext. I would have thought internal request go to -int [17:04:19] ebernhardson: I think that one is T351081 [17:04:19] T351081: CirrusSearch might log a user-agent set to MediaWiki/1.42.0-wmf.4 when the original request does not have any User-Agent set - https://phabricator.wikimedia.org/T351081 [17:04:34] oh, that would make sense [17:04:54] with wdqs we ban those clients forcing a UA set, for MW I guess we can't :/ [17:05:16] yea probably not. But we could use a better placeholder [17:05:38] true, was very surprised the first time I saw it [17:12:14] it's actually a massive % of requests that have this :S For the given hour of all web searches ~10% have the mw agent [17:14:26] yes... I remember now that this question was raised from an analyst of the Mikail team, they were probably very surprised to see that many events [17:14:54] and they should really not considered as "internal" searches [17:14:59] *be [17:15:13] yea i was going to consider them internal until i started looking at them, they seem like a random assortment [17:15:23] yes [17:35:08] * ebernhardson apparently doesn't remember how browers work :P [17:35:56] user clicks a result in autocomplete, Special:Search redirects them, and that generates a page view. Whats the referer on the page view? It's not special:search :P [17:36:21] that'd be too easy :) [17:37:07] indeed....i mean i'm sure i could self-join webrequests to find it...thats certainly the opposite of easy [17:41:07] lunch, back in ~40 [17:41:24] ryankemper note that I moved our pairing session back as g-ehel isn't in this week [17:41:36] * ebernhardson ponders cheating and considering the redirect a page view, instead of checking the actual web request to verify [17:43:55] dinner [18:34:41] suspicious size for a log file that doesn't seem to be appending anything new: 2147483637 [18:49:47] Administrivia: (1) standup updates if you have them, and (2) tomorrow is the Product + Tech regular meeting, so we'll join The Wednesday Meeting after that as per usual. [18:56:53] in very rough first pass numbers, autocomplete is 2% of internal page views, 0.45% of all page views. special:search is 0.17% and 0.04% resepctively. For an arbitrary single hour and probably still including various error classes [19:20:54] So, autocomplete looks to be about ~11x special:search in this sample, so ~92% of searches (out of autocomplete and fulltext) are autocomplete. That's a bit higher than previous estimates (or at least my recollection of them) at ~85%. I'll try to update my internal model to ~90%. Very cool. Thanks for the details! [19:24:23] well, these aren't exactly searches, these are page views attributed to searches. So it's a slightly different metric [20:33:12] also, outliers galore. After grouping by access method realized i'm missing mobile web. So the ratio of page views above would be 6.3% and 1.3% for autocomplete, and 0.62% and 0.13% when looking only at desktop. Less than 0.00% for mobile app and mobile web [20:33:37] (expanded to looking at a full day, as well) [20:39:10] and because it's randomly curious, this claims barbados has the highest ratio, at almost 4% of all pageviews coming from autocomplete. netherlands the highest "large" country at 1.3%. Iran not really a fan, they had 12M pageviews that day but only 0.07% from autocomplete [20:57:45] inflatador: finishing up lunch, about 10m late for pairing [21:01:32] ryankemper ACK, np [22:13:01] ebernhardson: I was about to introduce a new histogram metric to the SUP for search update lag, see T328330. Right before writing to ES, I’d subtract now - update.meta.dt, which should result in a time that could be compared with cirrussearch’s backlog time, see https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1. However, once the saneitizer is running as part of the consumer, I’d need a discriminator so I [22:13:01] only meter events that originate from a wiki. Off the top of your head: Are the saneitizer-sourced UpdateEvents distinct? [22:13:02] T328330: Create SLI / SLO on Search update lag - https://phabricator.wikimedia.org/T328330 [22:20:36] We're close on the Puppet patch, just need a hash instead on an array for the persistent object https://puppet-compiler.wmflabs.org/output/1018360/3314/elastic2080.codfw.wmnet/fulldiff.html [22:24:16] ebernhardson: Looks like the lack of meta data (request ID) should suffice as discriminator