[07:40:01] o/ [07:41:23] o/ [10:27:15] lunch [12:02:16] in `event.mediawiki_cirrussearch_request` what is the story with requests with `user-agent` set to `WMF/cirrus-streaming-updater-consumer-search`? [12:10:01] cormacparle: these are requests made by the search update pipeline, probably the process that checks the consistency of the indices [12:29:04] 👍 [13:07:05] o/ [13:16:49] Speaking of claude.ai, what do y'all think about getting some kind of shared account? Dunno how much it would cost, but it might be nice to see each other's questions/workflows [13:21:49] \o [13:22:02] o/ [13:23:06] inflatador_: the workflow is pull the slot machine lever, see what comes out :P Sometimes it's useful at least, but still trying to figure out under which cirucmstances [13:25:44] haven't use them for code generation much yet, I might ask for guidances once in a while when exploring a new topic [13:26:03] ebernhardson ACK. I get a lot of mileage out of ChatGPT, although it's mostly for regexes and promQL or jq queries. I'm definitely thinking a shared instance would be nice, although maybe among SREs if not Search Platform [13:28:13] as for shared account, it's probably not that expensive, i think claude is $20 or $25/person for the basic setup. But i briefly looked at the software aquisition form we would have to fill out and it seems tedious [13:30:57] Sounds like a job for g-ehel once he gets back ;P [13:42:30] hmm, it looks like the way the new cirrus dumps in airflow is designed, it only syncs the dumps if 100% of wikis are a success? [13:43:38] no clue? but possible that it's seen as a single process for the all the wikis [13:44:25] it looks like there is a "sync" step after the section dumps, and it's marked to only run if all_success [13:44:57] s5 failed because tlwikisource_content is missing somehow, which prevents any sync [13:46:05] and indeed, i don't see that index on eqiad or codfw. can create it, but doesn't seem like a great idea to block all the others [13:46:44] sure [13:47:55] the index not created could possibly be due to T401633 [13:47:56] T401633: UpdateSearchIndexConfig.php fails with "Named cluster (dnsdisc) is not configured for maintenance operations" - https://phabricator.wikimedia.org/T401633 [13:48:17] dcausse: hmm, we fixed that since then i guess [13:48:34] I just fixed it this morning, Trey merged the patch [13:49:09] dcausse: oh, actually that needs to fetch the 'managed clusters' list, not the writable clusters [13:49:25] dcausse: thats the new thing i added to deal with this, the problem is most clusters aren't writable because SUP does the writes [13:49:29] yes, I think we missed it because we never use --cluster all ourselves [13:49:33] indeed [13:51:52] i've created them now by running each cluster, but the question is...we don't really have a plan for how to fill the index. I can backfill what's available in SUP but that's probably not everything [13:51:59] otherwise, two weeks for saneitizer i guess [13:54:14] there's https://gitlab.wikimedia.org/repos/search-platform/cirrus-toolbox#force-rerender [13:54:45] oh right, it's small enough that should work [13:54:50] yes [13:56:32] oh, i should have looked. It's an empty wiki anyways :P "Please do not start editing this new site." [13:57:21] :) [14:00:51] i feel like the dump is slower than before, but maybe it just feels that way. Maybe we also need to adjust the output to print the wiki, because i have no clue from these logs whats currently dumping...i' just assume its wikidata and commonswiki since tey have 100M+ docs [14:01:49] s8, which is only wikidata, was 2% on aug 8 at 11:44, 16% on aug 9 at 11:35. 30% on aug 10 at 11:11 [14:02:39] so, ~14% per day, or about 190/sec [14:03:44] is this after switching to sort: page_id ? [14:03:56] I was afraid of it being slightly slower [14:04:39] hmm, at 190 docs/sec it will take 7.25 days to weekly dump wikidata :P [14:04:45] not sure, checking [14:04:52] :/ [14:05:37] hmm, unfortunately it doesn't seem to say which version of mediawiki is being run. It started on the 8th though, so it is probably whatever deployed last thursday [14:06:39] sort: page_id was merged aug 7th so unlikely [14:06:58] the solution in python to make this faster (for hadoop dumps) was a whole bunch of parallelism, but thats tedious in php :P [14:07:08] this dump script if far from optimal anyways... yes [14:07:43] we should push the dump we take from hadoop to dumps.wmo [14:07:59] and stop running DumpIndex.php [14:08:51] yea [14:08:55] hmm, i only see 9 dags at https://airflow-search.wikimedia.org/ :S [14:09:10] do you have a filter set? [14:09:17] doesn't look like it [14:09:36] oh, the `running` thing is a filter [14:09:43] now there are 44 :P [14:10:12] seems like filters are somewhat persistent [14:10:13] looks like all wikis dump finishes in under 7 hours in hadoop [14:11:37] but indeed we should probably look into formatting the hadoop dumps, and then syncing however the new thing publishes [14:11:41] it's a bit of work to wire all this together to publish those to dump.wm.o but could definitely have value [14:19:26] oh, silly me. I was responding to an email asking about cirrus dumps not being available. Asked them what they use the dumps for [14:19:37] then i visited the home page of their email domain, "The search engine that values you as a user, not as a product [14:19:45] so, maybe it's a bit obvious :P [14:20:24] :) [14:20:25] https://www.qwant.com/ [14:20:49] a european search engine, interesting at least [14:21:19] * inflatador_ wonders how they compare to Kagi [14:21:20] from france originally :) [14:21:29] thought they were using bing [14:21:41] i kinda assume they all are, but maybe they have augmented bits [14:21:59] sure [14:22:01] Kagi? Yeah, they buy search results from Bing and other places [14:22:12] qwant [14:22:46] i suppose initially i thought thta was about finiance quant's [14:39:51] all the linters are happy with my reindexer rewrite...now on to the test suite. Perhaps the least fun part :P [15:23:36] https://en.wikipedia.org/w/api.php?action=help&modules=cirrus-check-sanity default is dnsdisc :/ [15:24:11] yea :S the dnsdisc thing has mucked with a few of our expectations that i didn't fully think through [15:24:47] it also nuked some of our metrics, because everything cirrus reports goes to the 'dnsdisc' bucket [15:24:58] well, not everything. but almost everything [15:25:32] maybe we could have nginx slap an extra provenance header on the responses, but wasn't sure we want to thread that all through [15:25:51] the other thought was replacing envoy with nginx so we get envoy metrics between envoy and localhost on the servers [15:25:55] I think the prom metrics know where they come from [15:25:57] err, replacing nginx with envoy [15:26:18] right, but not the once cirrus reports like percentiles, time spent per request type, etc. [15:26:21] but well that's not 100% accurate [15:27:04] I mean we know it's coming from prom k8s eqiad so dnsdisc is eqiad in normal conditions [15:27:20] but totally false when we switch over [15:27:24] yea [15:27:44] on the upside, sre can now move traffic with brief notice...so at least something came of it :) [15:27:57] yes... definitely nice [15:28:13] it saved our butts a few times with the eqiad chi issues ;) [15:59:59] Random question from today's DPE deep dive: how long is the saneitizer loop these days? I have "6 weeks" in my head, but ottomata said 2 weeks in the meeting and ebernhardson implies 2 weeks above. Looks like it might've gotten a lot faster and I forgot. [16:00:29] workout, back in ~45 [16:06:34] Trey314159: it's a two part loop, the main loop is 2 weeks. We visit every page every two weeks. But there is also an embedded slow-reindex every 1 in n loops. n is currently 8, so 16 weeks for a full reindex [16:07:01] cool, thanks! [16:08:22] (these are the saneitize-loop-duration and saneitize-rerender-frequency options of the flink updater) [16:10:06] * ebernhardson separately is not having fun with these tests that interact with k8s...so much state to remember in my head :P [16:26:39] dinner [16:52:09] back [17:35:21] heading out for coffee with u-random (we both live in SA), back in ~1h [18:20:43] hmm, it turns out if you patch builtins.print, the debugger can't print anything anymore :P [18:21:10] * ebernhardson should really fix this printing in various places...but thats after getting the main thing going [18:23:18] back [20:03:54] ebernhardson ryankemper do y'all have any interest in trying a master failover on chi eqiad today at pairing? We'd depool first OFC. If not, maybe later this week? [20:07:07] inflatador_: yea we can probably do that [20:09:59] Cool, we can work out of T400160. Ryan reminded me that we need to give it some time. [20:10:00] T400160: Investigate eqiad cluster quorum failure issues - https://phabricator.wikimedia.org/T400160 [20:31:21] i’m game [20:58:06] well thats scary. Spent the last week rewriting significant parts of the reindex orchestration. Ran it the first time in prod against just `aawiki`, and somehow it worked without failing. well mostly, it finished but never exits [20:58:17] i'm sure i just need to add a few more wikis and it will fail though... [21:04:24] * ebernhardson apparently forgot to tell the backfiller that no more backfills would register, so the thread just runs and nothing exits [21:04:38] i wonder if that even needs to be a thread anymore...but for another time [21:57:39] OK, eqiad is repooled. I haven't seen traffic patterns shift yet, so that's a bit odd [21:57:40] https://grafana.wikimedia.org/goto/er88rb_Ng?orgId=1 [21:57:54] probably just need to wait a few minutes [22:13:07] yup traffic's back [22:13:28] jfyi posted a little summary here inflatador_ ebernhardson https://phabricator.wikimedia.org/T400160#11032897 [22:13:41] sorry, https://phabricator.wikimedia.org/T400160#11080565 * [22:17:12] ryankemper ACK, great write up [22:20:02] +1