[07:22:43] plenty of pool counter rejections this night, most probably caused by a bot hitting codfw... [09:49:17] dcausse: I'm closing tickets and I see T316496. It says that the lag is reported on the UI (https://commons-query.wikimedia.org/), but I'm not sure where that is. I'm probably blind, but could you point me in the right direction? [09:49:17] T316496: WCQS does not report proper lag information - https://phabricator.wikimedia.org/T316496 [09:50:37] lunch [09:51:36] gehel: it's the small reload icon bottom right of the SPARQL input query [09:51:44] should be green [09:51:54] if you hover it you'll get the actual value [09:53:34] Oh right! [09:55:11] ryankemper, inflatador: I'm re-opening T316728, it looks like elastic1048-1052 are still listed in site.pp [09:55:11] T316728: decommission elastic10[48-52].eqiad.wmnet - https://phabricator.wikimedia.org/T316728 [10:05:50] lunch [10:21:46] ebernhardson: related to T316712, I don't see any "memory_issue" on https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&refresh=1m&viewPanel=9&from=now-90d&to=now&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [10:21:47] T316712: monitor circuit breaker exceptions in elasticsearch - https://phabricator.wikimedia.org/T316712 [10:22:52] Is that expected because we did not have any circuit breaker exceptions since https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/830239 was merged? Or is there something missing? Is there a way to test force an exception to test the reporting chain? [10:33:55] inflatador: regarding T313855, could you confirm that you have received additional alerts since that patch was merged? [10:33:55] T313855: Ensure we get alerts for WDQS - https://phabricator.wikimedia.org/T313855 [10:39:24] dcausse: how can I validate T316028? [10:39:25] T316028: Run the rdf-streaming-updater from k8s@codfw - https://phabricator.wikimedia.org/T316028 [10:42:05] inflatador, ryankemper: T294806 seems to be close to done. Could you double check if we have more hosts that need to be decommissioned as part of this work? Or can we close it once T316728 is addressed? [10:42:06] T316728: decommission elastic10[48-52].eqiad.wmnet - https://phabricator.wikimedia.org/T316728 [10:42:06] T294806: [Epic] Search platform - Hardware requests for 2021-2022 - https://phabricator.wikimedia.org/T294806 [10:42:49] Good day search people! We on the wikibase.cloud team at WMDE are starting to put some more focus into our elasticsearch/cirrus setup; we were wondering if you had (or perhaps in exchange for a bribe might even be able to deliver ;) ) some rough onboarding type presentation/notes etc.. Does anything like that exist? [10:46:41] tarrow: let's discuss how much you're willing to bribe us first! [10:48:01] :D hehe - I can see if there might be something at the organisational level; at the personal level I could certainly stretch to some rounds of fizzy drinks next time we're all in the same room [10:49:22] tarrow: https://docs.google.com/presentation/d/1adXzJzpVuAwURiiPoazHq5ndnpMoNiYh9MXIE33R1K4/edit?usp=sharing is probably about the bets I can do. There are various docs on wikitech.wm.o and mediawiki.org, but I don't think we have much in terms of onboarding. [10:50:39] Thanks! [10:50:46] That's an excellent start [10:51:28] tarrow: there are a few recordings that were really meant to be internal to our team only (so I'm not going to share them to the whole internet). But if you give me your email, I'll share them with you. [10:52:48] Note that we are starting work on modernizing our Search Update Pipeline. The concepts will stay similar, but the implementation will evolve greatly over the next few quarters. [10:53:34] wonderful, that sounds great [10:55:09] Is the Search Pipeline Update going to be "similar" to the streaming updater for the Query Service? [10:56:07] yep [10:56:36] at least in the sense that it will be (mostly) streaming and based on Flink [10:56:48] yeah, that's what I was thinking [10:57:10] tarrow: it might make sense to spend an hour with your team and a few people on your side to give you an overview. Let me know if that's useful. [10:57:40] If you had the time that could be super useful [10:57:57] a very big overview; or on specifically on the streaming updater? [10:59:01] whatever you need [11:00:42] lunch, back later [11:01:38] bon appetit [13:20:31] gehel will look into ^^ , haven't heard from service ops yet but I'm about to email them [13:21:51] inflatador: those servers are already offline, so I'm not sure you need to call on other ops. site.pp just needs to be cleaned (it should have been done before sending a decommission task to DC-Ops). [13:22:21] gehel I don't need them for that ticket, just because I'm visiting their team this wk ;) [13:22:52] Oh, right [13:23:02] 2 different things! [13:23:18] Have you received invites to their team's meetings? [13:24:17] Only for a single mtg, and it was cancelled already [13:24:44] You should ping Alex directly on IRC and see how things are going. Unless you want me to be the messenger [13:25:34] No, I'll ping him shortly, trying to fix something on my Mac first [13:25:53] Ok. Let me know what happens [14:32:40] gehel: for T313855 i expect yes, those are through statsd which creates stats on-demand (rather than registering 0's like prometheus). the related prometheus metric is also 0: [14:32:41] T313855: Ensure we get alerts for WDQS - https://phabricator.wikimedia.org/T313855 [14:32:43] https://grafana-rw.wikimedia.org/explore?orgId=1&left=%7B%22datasource%22:%22eqiad%20prometheus%2Fops%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22topk(10,%20rate(elasticsearch_breakers_tripped%5B1h%5D))%22,%22range%22:true%7D%5D,%22range%22:%7B%22from%22:%22now-30d%22,%22to%22:%22now%22%7D%7D [14:34:14] ebernhardson: ok, looks good enough to me! (not sure why I did not see those errors on cloudelastic...) [14:35:32] FYI, Flink's page mentions a license change for 'akka', does this affect us? https://flink.apache.org/news/2022/09/08/akka-license-change.html [14:35:36] ebernhardson: I assume you're talking about T316712? not the WDQS one [14:35:37] T316712: monitor circuit breaker exceptions in elasticsearch - https://phabricator.wikimedia.org/T316712 [14:36:23] inflatador: not really, we'll let Flink deal with the mess, but they seem to only have a limited dependency on akka, so they'll probably replace it with someting else. [14:36:50] gehel cool, only one hit for 'akka' in the puppet repo and it's not us ;) [14:37:39] gehel: doh, yea copied the wrong T :) it's the '712 [14:39:52] ok, closing it! [14:39:54] thanks! [14:58:02] reindex still in progress :P takes awhile [14:58:25] somewhere between srwiki and tawiki depending on cluster [16:05:19] meh, was asked to do a fireside chat for a search class to which i said yes...but now they want a bio to include and i'm terrible at writing :P [16:14:51] I couldn't find a "backlog" column, moved this one to "Ops/SRE" and took off the "Current work" tag...LMK if I need to change anything else https://phabricator.wikimedia.org/T303011 [16:36:48] getting float precision errors on SearcherTest, tempted to use assertEqualsWithDelta [16:53:05] errand [16:58:17] dcausse: i wrote a patch for that, i think. sec [16:59:15] hmm, that patch was merged so i guess i didn't fix it :P [17:09:12] ebernhardson: yes, apparently not, I think it's because we only normalize when writing the fixture not when comparing [17:10:10] or perhaps we should compare the json output not the array [17:12:33] comparing the encoded json works [17:12:46] hmm, doesn't it use assertFileContains? We should probably prefer that [17:13:19] that helper also handles writing the fixture on rebuild, which means it gets the same processing [17:13:53] hm writing the expected file is guarded with a strange if ( is_string( $expected ) ) [17:13:57] looking [17:14:24] dcausse: do you have old code? is_string($expected) was previously the marker that we are doing a fixture rebuild (fixture contains the filename instead of the content) [17:14:37] I think I rebased, looking [17:14:46] but i thought i killed those with the update to use the generic fixture handling in SearcherTest. Maybe i missed something [17:17:06] bah, my bad sorry, thought I rebased but must have messed up something with git, it works now :) [17:18:59] dcausse: reminds me, there are some followup patches in gerrit for that one, i had to add a BC shim to make WikibaseCirrusSearch work since it changed the return type from Searcher, would be good to merge those and get the bc code removed [17:19:24] oh missed those, looking [17:19:36] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseCirrusSearch/+/832376 [17:20:01] that one changes WBCS to use the new one, and then there are a patch each for cirrus and wbcs to remove the bc stuff [17:34:20] lunch, back in ~45 [17:42:06] wrote a quick thing to check on the new field existing in aliased content|general|file indices, the usual suspects failed .... eqiad: enwiki_(content|general), commonswiki_file. codfw: commonswik_(file|general), enwiki(|books|news)_(content|general), cloudelastic: enwiki_(content|general), igwiki_general [17:42:35] (for wikis <= srwiki, since various clusters are working on the rest) [18:07:11] meh, elasticsearch has a ticket for "Replace _scroll with the ability to acquire point-in-time views + search_after" which was merged in 7.10.0. In x-pack :S [18:08:26] sigh... I wonder we can mimic that with a search_after based on page_id [18:09:45] won't be a point in time like _scroll but maybe close enough [18:09:52] yea will see what we can do, probably something [18:12:20] curiously while the patch mentions x-pack the docs don't seem to. But the queries don't work on our docker image so implies it does require x-pack: https://www.elastic.co/guide/en/elasticsearch/reference/7.10/point-in-time-api.html [18:24:34] back [18:29:18] ryankemper: I'll be 2' late [18:30:00] gehel: 5' late myself actually [18:30:05] see ya soon [18:53:23] search_after directly on page_id seems reasonable. A quick test on enwiki_content (match_all w/ no source) takes < 10 min to iterate all docs, reasonably consistent 600-1000ms per 10k docs. [18:56:16] might get a little slower as it gets later into the iteration, would have to graph the values and repeat a few runs to be sure [18:59:12] Dr. apt, back in ~1h [18:59:59] 3 [19:50:53] back [19:53:24] yea scroll after looks like it will work fine. It's slightly wonky because it's going to sort the page_id's as strings...but not important [19:56:34] * ebernhardson wonders if should change the other scroll use cases too... [20:31:37] * ebernhardson notes that writing the tests for SearchAfter is much more involved than the class itself :P [20:54:19] ryankemper or anyone else, I have the patch up for final decom of elastic10[48-52] if you have time to look: https://gerrit.wikimedia.org/r/c/operations/puppet/+/835269 [21:32:17] Mac wants to update, it says it'll be about ~30m [22:03:14] back, but headed out for the day [22:03:15] (and man, that was 30 minutes almost to the second!) [22:55:50] they're clearly not using the windows updater algo :P