[14:42:22] \o [14:43:13] o/ [14:46:03] ouch, did not consider that reindexing would be hard on k8s apis :( [15:21:24] wha hoppen? [15:22:46] yes me neither :( [15:23:12] inflatador: https://phabricator.wikimedia.org/T411871 [15:23:14] inflatador: T411871, got pinged by Janis today, it's causing some noise apparently [15:23:15] T411871: Improve cirrus reindex orchestrator to limit its impact on k8s API response times - https://phabricator.wikimedia.org/T411871 [15:23:17] the graph is pretty telling [15:24:05] it's not yet hurting to a point where I need to pause it but there's probably something we could improve [15:26:43] section extraction should be mostly working, i'll try and clean it up some more today, but then i'll be out till january. The one problem i see with it right now is most of the newlines are "accidental", meaning they come from the html source, as opposed to being injected for block-level elements like

[15:27:15] the newlines come out fine, but i suspect that it only accidentally works due to something in parsoid retaining all the newlines (perhaps for wt<->html roundtrip purposes [15:27:52] does it mean we'll get extra new lines or more like we'll miss possibly important ones? [15:28:14] dcausse: well, it means the newlines have to do with where newlines were in wikitext, as opposed to how it was rendered in html [15:28:25] dcausse: but the wikitext has newlines between paragraphs and such [15:28:42] ok [15:30:56] interesting, thanks for sharing that [15:31:36] for the reindexer, i do wonder that we do polling. The "right" way might be some sort of events listing, but a simpler way might be to slow down the polling rate (Orchestrator._step) [15:31:58] i think i kept the step rate fairly low to speed through tiny wikis, but it is a 3 day operation. 20 extra seconds * 800 tiny wikis still isn't that much with parallelism [15:32:27] err, step delay low, step rate is apparently a bit high [15:32:49] my understanding is that deployment of the mwscript and its removal is very costly [15:33:02] oh, so it's not our polling, it's the actual submissions [15:33:35] yes, apparently it pulls all the secrets (all namespaces) [15:33:40] thats a little harder :S [15:34:07] no clue why it needs this and this could be something to be optimized [15:34:16] if not that's a lot harder for us :( [15:34:32] it does sound like the fix is outside our control, other than rate limiting submissions [15:35:16] I think they want the foreachwiki use-case to use https://wikitech.wikimedia.org/wiki/Maintenance_scripts#Running_on_multiple_wikis_(the_safe_way) [15:35:36] but that completely breaks the current design :/ [15:35:57] i don't think thats nearly as reliable, but we could somehow parse everything out of a few mwscript invocations [15:36:04] i feel like the error handling would be a lot more squishy [15:36:13] totally [15:36:44] you'll fail & retry the whole batch of wikis [15:36:45] i guess if they need, would could probably find some way to batch small wikis together into one invocation, but seems meh [15:36:55] (not a single, but like 8 at a time or soething) [15:37:00] yes [15:37:56] was also wondering how far we are from being able to schedule the reindex directly from python, IIRC you already implemented the logic that tracks the reindex task [15:38:18] the report.py reports on the reindexing tasks, but i dont think i use it in the normal code yet [15:38:33] ah ok [15:38:43] by schedule from python, you mean invoking the _reindex api? [15:39:54] doing most of what UpdateSearchIndexConfig is doing... except that it'll get the index config/mapping from the mw api, meaning we rewrite a bunch of code :( [15:40:33] oh, hmm. yea possible i suppose. [15:40:41] gotta do a school run, back in 15-20 [16:03:44] back [16:09:01] the reindexator will need some retries when talking to opensearch, got a couple failures with 503s [16:14:01] some for cirrus, failed a couple times fetch the opensearch version :/ [16:15:47] :S [16:16:13] i guess we paper over that with envoy retries for readers, but it shouldn't fail simple in-DC requests [16:16:19] curious if we have some metrics about the envoy retries [16:16:31] hmm, there must be [16:20:39] should the metrics from orchestrator itself: https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&from=now-2d&to=now&timezone=utc&var-datasource=000000005&var-origin=misc&var-origin_instance=deploy2002:9631&var-destination=cloudelastic-chi-https&var-destination=cloudelastic-omega-https&var-destination=cloudelastic-psi-https [16:21:04] but the rate is perhaps too low to be visible? [16:21:32] connection timeouts are surprisingly high [16:22:09] indeed it's hard to tell from the graphs, but it seems req rate is ~0.02req/s, and timeouts are also close to 0.02/s [16:22:52] :/ [16:23:12] but firing off a while loop of 100 curl requests from deploy2002 doesn't hit any errors :S [16:26:13] hm.. the mw-script namespace does not seem to export envoy metrics :/ [16:29:48] and no data on retries from mw-api-ext -> search, only eventgate/shellbox [16:33:15] * ebernhardson notes while poking at things...there are mw-script invocations that have been running 60 days [16:33:52] heh, it's a shell.php [16:34:17] 59 days, also a shell [16:34:35] * ebernhardson was looking for a reindexer execution [16:35:14] yes sounds similar to what happens with jupyter notebooks :) [16:35:48] yea pretty much. although there might be some small amount of danger with them running now-undeployed mediawiki code [16:36:02] but a shell.php sounds a bit dangerous, it's many mw versions behind [16:36:39] indeed. Hopefully the users simply close them out when realizing it's still open [16:38:28] fwiw, a loop of 100 `kubectl exec ...` tha runs curl http://localhost:6502 (==eqiad 9243) inside a pod currently running UpdateOneSearchIndex.php...no timeouts [16:39:11] so it's itermittent...maybe it depends on terribly awkward things like the network path between the pod and whichever host it landed on [16:40:32] yes... [16:41:12] err, was 6302, but same idea. [16:43:21] will scan the reindexer logs a bit more but if it always happen on the version check (which I suspect to be the first call), could perhaps be related to how containers are loaded, they added something to make that envoy is ready before the script runs but who knows maybe it's 100% reliable [16:43:24] *not [16:43:30] quick school run [16:43:48] oh that does seem plausible [17:23:31] yes seeing some "The service mesh is unavailable, which can lead to unexpected results." which gracefully fails or version check errors [17:24:17] seeing many failures (esp eqiad & codfw) because of index replication not happening :/ [17:28:03] well 53 times for codfw over 2010 indices, 27 over 1602 for eqiad [17:30:53] loaded my updated classes into shell.php, surprisingly seems to mostly work without failing. Except it doesn't finish a loop of 1000 docs before the pod gets killed (i assume excessive memory usage? still looking) [17:33:54] yea, OOMKilled [17:39:19] :/ [17:39:40] we don't usually do hundreds in one request, and the limit is 1200Mi, so likely fine in prod [17:41:24] parse 1000 pages at once? no I doubt even if perhaps possible from the action API? [17:42:44] perhaps worth trying patchdemo for this one? [17:43:50] the cirrusbuilddoc should not require any specific backends but no how hard that is to use patchdemo [17:43:56] *no clue [17:49:40] my bad probably not worth the effort, just looked at your patch in core and it's only search related things that you're touching [17:56:55] heading out have a nice week [18:12:52] sigh...more testing and find...`

foo

bar

baz

` turns into `foobarbaz` without spaces :( Have to do something about block quotes, maybe i need to extract html instead of text and use Remex [18:13:01] s/block quotes/block elements/ [18:42:02] * ebernhardson starts wondering if the html bits are overkill and we should just substr the source document....but i guess for testing this will be fine [19:34:38] oh cool, patchdemo now has support for a wiki farm. Maybe we can get cirrus going in there one of these days [19:45:29] err, damn. SFO -> IST, nonstop leaves 6:50pm arrive 7:00pm