[08:06:49] o/ dcausse: are you around? I’d like to enable another page_rerender producer: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/989442 the back port window is already open but we can use a later one too [08:25:38] pfischer: am around, just +1ed your patch [08:28:55] pfischer: I can deploy I guess [09:02:00] Was AFK, thanks! [09:03:11] Here’s the related SUP chart deployment patch: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/989443 [10:14:59] hm... might have underestimated the time to run 50k sparql queries against the same endpoint... with 5 concurrent queries it's been running for 13hours :/ [10:15:43] load is at 25 so not sure I can increase concurrency [10:16:38] Load = CPU utilisation (%)? [10:18:09] load as in cpu load which I think can be high on io contention as well [10:19:13] 14 partitions out of 100 done so far, it's never going to end :/ [10:22:04] Oh, that’s indeed slow. What would be the risk of increasing load? Is this running against a production (k8s) cluster? [10:23:56] it's a dedicated machine for this test so the risk is pretty small, just that it might slower past a certain concurrency level [10:51:30] Can I support you in any way? [10:58:18] pfischer: I don't think so but thanks for the offer, I'm pretty much exploring options at the moment, e.g. let it run on smaller chunks and automate all this and just wait for several days for it to finish or perhaps consider smaller samples (5k instead of 50k)... [10:58:32] might be a topic to discuss in our wed meeting [11:26:15] lunch [13:38:58] got results for pywikibot, out of the 10k total queries 9885 succeed on both endpoints, 9091 (91%) give exactly the same results, 9337 (94%) with varying order, need to look manually at the difference but glancing at 4 examples it seems to be queries with "LIMIT 1000 OFFSET 1538000" which can't return stable results without a sorting criteria I guess [14:18:43] some of the queries are from AuthorBot (T311301) which are likely to break [14:18:46] T311301: What's in a name? - AuthorBot: Process and Progress - https://phabricator.wikimedia.org/T311301 [14:18:49] o/ [14:18:57] O/ [15:03:34] pfischer we're in pairing if you wanna join [15:38:50] hello! I'm currently working on migrating jobqueue jobs to k8s jobrunners, and the next candidate on the list is cirrusSearchLinksUpdate. I'd heard mention that there might be plans for deprecating this job in favour of flink - is that still a plan? And if so is there a timeline? [15:51:07] gehel: unsure if it's just me but seems like the office hour calendar event is not there [15:55:16] hnowlan: yes this is still the plan, pfischer might have more details on the rollout [15:57:06] just realized that it is only in the staff calendar. Here is the link: meet.google.com/vgj-bbeb-uyi [16:21:15] hnowlan: That’s right, we want to replace those jobs by a stream processing solution. This will happen gradually during this quarter. [16:21:42] pfischer: good to know, thank you! [16:29:14] workout, back in ~40 [17:09:06] back [17:47:31] My laptop is fixed, hooray [18:06:48] dinner [18:16:06] inflatador: 🎉 [18:16:10] dinner [18:40:58] Ticket for enabling compaction on kafka-main: https://phabricator.wikimedia.org/T354794 [18:42:38] lunch, back in ~30 [19:39:52] inflatador: good news on the laptop! [19:56:12] * ebernhardson wonders how much of the original laptop is left :) [19:59:33] I doubt much of it. I'll be sad if I lose my Wikipedia sticker though, I didn't back that up [20:31:17] * ebernhardson wonders what he broke moving NetworkSessionProvider to it's own extension. It worked in cirrus but now it's not :P [21:59:01] Just got a "search is too busy, try again later" on Wikitech :O [21:59:23] I guess that's me? https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&from=now-3h&to=now&viewPanel=25 [22:03:57] inflatador: hmm, would have to check the timestamp in logstash for a better message [22:04:12] typically that's pool counter rejection, not sure if there are other ways [22:08:36] ebernhardson no worries, looks isolated to me AFAIK [22:08:42] it does look like pool counter rejected some things at 21:40, 21:50 and 21:58 [22:15:02] ebernhardson do you think this is worth investigating? In pairing w/Ryan but can shift gears if it looks serious [22:17:07] inflatador: nah doesn't look serious. I kinda wish we had a dashboard about how full the pool counter is to make better decisions, but pool counter doesn't have any per-key metrics [22:38:17] ebernhardson: is it just the saneitizer running? the timings of the spikes in requests make it look like that but I forget how often saneitizer runs these days https://grafana-rw.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?forceLogin&orgId=1&from=1704368725769&to=1704925553216 [22:43:19] ryankemper: hmm, i know we disable the pool counter in certain areas, checking if we do it there [22:46:52] hmm, no it's only for maintenance scripts that we disable the pool counters, so they are running. It really doesn't do that many queries though, it fetches in large bulk requests. I suppose if enough checker jobs are running in parallel, each takes a spot. per the jobqueue job dashboard, those have concurrency of around 50 jobs at a time. [22:46:57] so, maybe? [22:49:52] hmm, doesn't really line up though. rejections at 21:40, 21:50 and 21:58, but the cirrusSearchCheckerJob ran 20:11-20:50 and then 22:11-now [23:34:39] lol, been trying to track down for like 2 hours why autocreation doesn't work...it's because i told it the only rights the user can have are ['read'], and createaccount isn't in that list :P [23:38:24] have a better understanding of how account auto creation happens at last [23:38:26] least