[00:06:29] Alright, talked to m.utante; we'll pair on monday on updating the certs. meanwhile we pushed https://gerrit.wikimedia.org/r/c/operations/puppet/+/991680 to quiet down probedown alerts like https://gerrit.wikimedia.org/r/c/operations/puppet/+/991680 [00:06:41] oops, last link should have been https://phabricator.wikimedia.org/T355278 [00:14:49] just depooled wdqs1020, it's supposedly 4 days behind on lag... [00:15:37] inflatador: thanks, restarted the host (it's been deadlocked) but got distracted before depooling it [00:24:53] ryankemper np. looks like wdqs1019 is super lagged too if you wanna take a look [00:33:16] odd that both server's lag is flat [08:44:53] o/ I’m trying to understand how/when does the SaneitizeJobs maintenance script get executed. It is my understanding that launching the script on an arbitrary mediawiki installation would launch it in the context of that installation, e.g. testwiki, so any $config->get(‘key’) calls would be resolved by looking for `testwiki` override first and fall back to `default` otherwise. So how do we make sure this runs [08:44:53] continuously for any existing wiki out there? [08:47:32] s/any/every/ [09:52:56] pfischer: it's a big loop: foreach wiki: do saneitize($wiki); done (https://gerrit.wikimedia.org/g/operations/puppet/+/11ee874406493fc15bf416d785c6237d760b693d/modules/profile/manifests/mediawiki/maintenance/cirrussearch.pp#25) [09:53:46] note the "foreachwiki" [09:58:18] Thanks, that’s what I was looking for! [10:38:22] dcausse: the 2 dumps we need for the WDQS split are the main namespace and lexemes? [10:44:21] gehel: yes [10:46:01] gehel: thinking about this I think I like the approach to use "the time it takes to recover", I might be inclined to also include the catch up phase, I'll check few numbers but it's perhaps neglible [10:46:29] s/neglible/negligible/ [10:47:19] The worst case recovery that we measured on T241128 was ~50h. Given that we will have a lot less lag (due to optimizations in the previous steps), the catch up is likely to be a lot shorter. [10:47:19] T241128: EPIC: Reduce the time needed to do the initial WDQS import - https://phabricator.wikimedia.org/T241128 [10:47:57] I'm tempted to exclude it since it should have a lot of variation depending on when we start the recovery and thus how old the dumps are [10:50:06] 50h is for 4.3weeks which should not happen, I think we can get a sense of the worst possible scenario and see if the value is "small" enough to be ignored [10:53:57] lunch [13:37:20] Trey314159: search standup notes mention ongoing refactoring. Do you have a link to the CR / MR? [13:40:22] dcausse: would it make sense to copy your standup notes to the task (T355040)? [13:40:23] T355040: Compare the results of sparql queries between the fullgraph and the subgraphs - https://phabricator.wikimedia.org/T355040 [13:40:38] gehel: I can sure [13:40:45] thanks! [14:15:59] o/ [15:05:36] small CR to bring some cloudelastic hosts online if anyone has time to look: https://gerrit.wikimedia.org/r/c/operations/puppet/+/991788 [15:06:08] gehel: the refactoring is all local to my machine at the moment [15:08:16] wait up on that patch...my cherry-picking failed [15:16:20] Trey314159: ack [15:17:35] OK, I think I got it this time... [15:59:59] \o [16:01:53] o/ [16:23:07] One more CR to (hopefully) fix TLS on the new cloudelastic hosts: https://gerrit.wikimedia.org/r/c/operations/puppet/+/991797 [16:57:15] So i'm looking over T353427, ConsumerApplicationIT should fail when the update request payload changed. I'm not entirely sure what we should be asserting though. Should it record the requests sent to elastic and verify they are constant? [16:57:16] T353427: ConsumerApplicationIT should fail when the update request payload changed - https://phabricator.wikimedia.org/T353427 [16:59:27] unsure, I think it's surprising to have that test not fail, ideally, I guess, we should have something we have in cirrus so that it's easier to review what's the actual update request and see changes during review in the diff view [17:00:03] we should have something *like* we have in cirrus [17:01:00] dcausse: ok that makes sense. I can wire that up. I suppose the other alternative would be querying pages out of elastic and verifying the stored doc matches what we expect. Maybe do both? [17:01:32] oh actually we already do that second part iiuc [17:20:59] Thomas from Data Eng made a cool alerts dashboard https://dpe-alerts-dashboard.toolforge.org/ [17:22:53] nifty [17:23:36] * ebernhardson is mildly surprised at the lack of javadoc in wiremock [17:26:12] lunch/dr appt, back in ~3h [19:12:30] * ebernhardson gives up on working with wiremock...too tedious. Easier is going to be capturing UpdateRequest objects in UpdateElasticsearchEmitterTest and using toXContent to get json [19:13:16] i guess it's not really running the full pipeline though, will have to ponder how much this differs from doing the more end-to-end recording [19:49:50] * ebernhardson now realizes that of course, toXContent is only the json body and doesn't include the bits that would have been the url or the query parametres :P [21:50:49] * ebernhardson tried to go back to the wiremock bits after getting the emitter test worked out...wiremock still not doing what i expect it to :P [22:15:39] the weird thing is ... i can inspect the actual response the RestClient receives, and it contains the string `"took":287`. But nothing in the test resources contains the number 287...where does that come from? :S [22:28:58] oh! it's that silly java thing where the place where we put files isn't where it reads files from... deleteing the target/ directory and re-running gets a fail. I suppose i expect nothing in the target/ directory should effect a new build? Maybe i have the wrong expectations of java... [22:29:14] and wiremock *really* doesn't want to allow me to provide an exact path to a single json file for it to use :P [22:31:23] well, i can write a custom implementation...but i would like to just pass the filename somewhere