[08:32:46] After scaling SUP down to 1 instance, two topic partitions (#2, #3) started backlogging again. Both partitions are roughly ~4gb larger than the other partitions, but all partitions grow at roughly the same rate. However, SUP was not able to consume faster than they grow so the lag remained at a constant level. [08:33:32] Scaling up to two replicas (with cpu: 1 each) immediately helped to bring down the lag within 1h. [09:02:00] interesting, envoy seemed to have struggled a bit with only one container [09:03:44] pfischer: out of curiosity, you seemed to have reduced mem.limits from 3g to 2g is intended? [09:03:54] s/is this/ [09:05:58] yes, it’s back to the template’s default (2g), I wanted to find the min. requirements for day-to-day use [10:58:53] lunch [13:11:48] o/ [13:18:47] o/ [14:02:46] Will dropping RAM cause those OOM problems to resurface? [14:22:17] inflatador: just checked and seems like we still get a few https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-k8s-1-1.11.0-6-2024.15?id=MXZnzY4B8T_a4T-eBAYk [14:22:27] pfischer: ^ [14:29:18] We can afford to use more memory, there's about 14 TB available on the wikikube cluster and 100 GB in our quota. I hate taking the blunt approach, but I think it would help stability [14:31:03] agreed, I think 3g would not be particular high tbh, all other flink jobs use at least 3g [14:54:19] dcausse pfischer I'm filling out the application security review task for async-profiler , can you fill review/finish https://docs.google.com/document/d/1UoU4SO9y-LQBt0Qijz-bBZN5pbUMjegYkSR80-Zreg8/edit ? LMK when finished and I'll submit thru Phab [14:54:49] Thanks! I’ll have a look [14:57:54] inflatador: thanks! [15:03:31] inflatador ryankemper pfischer retro if you can make it [15:04:22] Sure, one sec. [15:32:51] pfischer https://scorecard.dev/ for the repo security, https://github.com/boyter/scc for the code complexity [16:08:53] workout, back in ~40 [16:18:13] administrivia: async standup notes. i'll be preparing usual weekly updates, so "please and thank you" as our esteemed colleague says :) [16:20:49] i.nflatador i put a dependency item in the standup notes about the nvme(s), should i ping on the task(s)? there's the one for wdqs2024, and then i also noticed (sorry i somehow missed it earlier) that there was a different one for wdqs1014. happen to know if it's just wdqs2024 or is it both? i can also raise the question of one server vs two, to help move it along (or maybe you want to do so?) [16:41:22] * ebernhardson can't seem to understand why my page view attribution seems to work for mobile (although the numbers are terrible vs desktop, overall autocomplete pv are ~2.1% for desktop and 0.6% for mobile)..but then my actor counting using the same base stuff is finding 2 total actors in mobile :P [16:44:50] ebernhardson: mobile does seem to link the page directly not Special:Search when click a result, could that explain? [16:45:31] dcausse: it should still detect, this is based on some code i added to minerva that attaches searchToken to the current page uri before it goes through, so the searchToken is in the referer [16:45:48] we added that to start getting training data from mobile (and then i'm not sure ever anaylzed the results) [16:46:23] oh, the answer was the billion dollar mistake, nulls [16:47:03] i built up a variety of simple boolean conditions in spark to result in the final filter, and wprov.startswith('foo') can be null if wprov is null, which then propagates up [16:48:52] stumbled on this kind of things multiple times... [16:48:54] this says mobile doesn't like us nearly as much :P Maybe it means opportunity to improve. For random hour 3% of desktop users searched and visited a page, 0.8% of mobile [16:49:11] still missing mobile fulltext, but it's probably minimal [16:49:18] :( [16:57:14] ebernhardson: can't remember how this works so it's me probably not testing properly but I can't seem to see where we add such params using minerva? [16:57:45] dcausse: https://gerrit.wikimedia.org/g/mediawiki/extensions/MobileFrontend/+/a1d3a6ce5b90cf7ce4f696b44b25a7e5bf6e3392/src/mobile.startup/search/SearchOverlay.js#145 [16:57:55] thanks! [16:58:54] you should be able to trigger it by visiting .m. from desktop, if it's not working for you then maybe i'm undercounting [17:01:30] i suppose on the randomly curious side, us tops the list (of countries with >100k actors that day) at 1.6% of mobile sessions searching, while india at the bottom only has 0.2% [17:45:25] oh I think I know why I got confused, it's adding searchToken to the current page not the target page, or I'm still missing something :) [17:48:01] dcausse: right, it adds it to the page the user comes from which we then see in the referer attribute of the page view [17:48:25] makes sense, thanks! [17:57:25] dinner [18:06:24] lunch, back in ~40 [18:44:04] dr0ptp4kt: what was your question about wdqs1014 vs wdqs2024? I'm a bit confused [18:53:32] ryankemper: i was trying to see if both https://phabricator.wikimedia.org/T359456 and https://phabricator.wikimedia.org/T361216 are in teh works. the former i realized existed (or maybe i was reminded of, i don't remember) while i was looking around yesterday. i knew about the latter. and i was thinking about pinging to see if sales had gotten back about it [18:53:46] dr0ptp4kt ryankemper so far it's just wdqs2024. It was supposed to be wdqs1024 (as it's already a graph split host) but I messed that up. Too late to change it but we can make adjustments on our end [18:54:20] as in, we have plenty of capacity in codfw to claim 2024 as a test host, or just use it as a dedicated reload host in production if need be [18:54:30] if we have two machines with nvmes, it would be interesting to see the differences in performance because of different clock speeds and different gen xeon architectures. but one just to validate will be helpful. [18:56:31] inflatador: by "so far it's just wdqs2024" do you mean that https://phabricator.wikimedia.org/T361216 was created too late to fit in this year's plan, or something else? [18:56:50] i had a funny idea the other day, which was to try to loading the graph splits in parallel on the same host. for a machine with suitable ram, i'm thinking there may not be so much cpu and disk contention that it's impossible. we *don't* want to run both graphs on the same box, but no one said we couldn't load on the same box! i'm mainly interested to see, in case we have just one nvme. [18:57:06] cause based off the tickets it looks like both nvme upgrades are in the works, although it's been weeks w/o update so we prog want to ping r.obh to see if they've gotten responses [18:59:08] ryankemper what I meant is that I was only expecting a single NVME, and I thought it was going to go to wdqs1024 only...but then I thought I messed that up, and we were only going to get one for wdqs2024. But based on dr0ptp4kt 's ticket, it looks like we might get both, so...yay? [18:59:42] got it [19:01:14] Alright, pinged rob on both tickets to see if there's any updates on the quotes [19:01:44] ty! [21:30:42] * ebernhardson is going to guess the numbers that say mobile web and mobile apps have the same number of autocomplete requests for an hour might bewrong [21:31:44] or mobile apps doing something crazy? 571k autocompletes in on hour [21:32:02] ~= 160/s [22:03:55] that sounds a bit bonkers [22:05:33] mobile apps might just really like submitting extra autocompletes. They logged 338k page views to those 571k autocomplete requests [22:05:54] (or some bot figured out how to look like mobile apps? i dunno) [22:09:55] (err, not attributed to them, but overall page views from mobile) [22:15:08] * inflatador has no idea what's normal or not as far as our pageviews [22:15:36] I guess I should look at pageviews [22:17:22] autocomplete is actually quite busy. from the `elasticsearch percentiles` dashboard we have an operational top level on autocomplete requests, daily looks to be 400-950/s to the backend (plus whatever gets absorbed by http caching) [22:19:55] actually thats just eqiad, add 150-280/s for codfw [22:26:49] https://grafana-rw.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&viewPanel=47 this guy?