[07:59:52] I'm trying to write the weekly status update, and there is almost nothing in https://etherpad.wikimedia.org/p/search-standup or in in the needs reporting column in https://phabricator.wikimedia.org/tag/discovery-search-sprint/. Is this because nothing got completed this week? [08:00:09] pfischer: is T358599 complete? Or do we have more to do? [08:00:10] T358599: Integrate Saneitizer with SUP - https://phabricator.wikimedia.org/T358599 [08:02:33] these "connect error: 113" started on apr 24 around 9pm [08:03:17] enwiki_titlesuggest is still missing a some pages :( [08:05:27] possibly related patch merged at that time: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1023937 [08:19:19] the timeing matches perfectly but these new hosts seem perfectly accessible from mwmaint... [08:31:51] gehel: I’ll update the etherpad [08:32:09] pfischer: thanks! [08:37:40] gehel: done [08:37:47] thanks! [08:45:19] gehel: T358599 is complete, that was waiting for a cirrus patch to be deployed but thats done now. [08:45:20] T358599: Integrate Saneitizer with SUP - https://phabricator.wikimedia.org/T358599 [08:49:03] Ok, I'm closing it [08:49:50] update published: https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2024-04-26 [08:56:51] tempted to re-instantiate https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1024478 (switch completion to codfw) while we figure out what the problem is [08:57:54] fixing the maint script to be more resilient might be too complicated for friday deploy... [09:01:06] dcausse: do you need help for any of that? [09:01:36] gehel: I think I need approval from an SRE, will ask [09:05:27] scream if you need me! [09:05:49] ryankemper: T346455 has been in progress since forever. Any chance to close it this week? [09:05:50] T346455: Allow federated queries with the NFDI4Culture Knowledge Graph - https://phabricator.wikimedia.org/T346455 [09:09:53] dcausse: if you don't get a reply in the security channel, you should probably just deploy and log in -operations [10:08:32] lunch [12:42:16] i'll send that Google Form by probably 2 hours from now. kid routine, just popping back into note it while it's in my mind [13:13:51] hm... checking if the number of docs in the completion index is sane is not straightforward... it has one additionnal doc if the original doc has at least one redirect that is not similar to the title... [13:18:44] dcausse do you want me to revert that puppet patch? and/or I can ban those nodes [13:20:34] inflatador: should not be necessary the error rate is pretty low it's just that our maint script is too fragile [13:20:47] now that we serve from codfw we should be good [13:21:05] was it related to that patch? [13:21:35] just wondering if that's something we need to check before we bring new hosts online [13:22:14] inflatador: no clue if this is related but a new error appeared right after this patch was deployed [13:23:30] I think we need to understand why this happens, I'm not clear what this "delayed connect error: 113" means [13:24:20] yeah, that's weird. How did you spot the index problem? [13:25:31] Erik investigated https://phabricator.wikimedia.org/T363516 when it was reported by a user on wiki [13:28:16] I guess we need some monitoring for huge index size changes or something [13:29:21] yes we definitely to improve that script [13:29:25] *need [13:31:25] inflatador: see https://phabricator.wikimedia.org/T363521, Alex is having a look as well [13:42:20] and checking a variation before and after is not easy either, on some wikis we re-use the same index [13:42:38] perhaps first we need to make sure that the scroll is getting to the end [13:43:50] iirc there were few other places where we gave up on this "php Iterable" scroll [13:44:51] might have been in the old inplace reindex script at the time we did not yet rely on the reindex elastic module [13:52:32] I'll individually ping on Slack, but just a heads up that the Google Form for the second round of consideration has been sent. [14:00:35] We used https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Elastica/+/399376 in the past [14:01:22] but my old self said: "it's misleading and does not work as expected. Better to use Elastica Scroll API.", I guess I was wrong... [14:06:20] oh perhaps the solution is using search_after like https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/835277 [14:29:35] uploaded https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1024698 (not well tested) but the transition seems straightforward [14:29:49] might need to add some more safeguards still [14:49:24] \o [14:51:50] dcausse: i was thinking of two bits to help the maint script. We can replace Scroll with SearchAfter, this doesn't have the bad behaviour of exiting iteration on failed request. Alternatively the scroll should return all expected documents by construction, so we could compare total hits to the # of docs processed [14:52:18] since scroll basically holds a lock on the segments at the time it was created, iiuc [14:52:39] o/ [14:52:48] ebernhardson: makes sense [14:53:21] i'm not sure which is better though :P The problem with search after is we lose that easy guarantee of comparing total hits [14:53:56] although, i was also a little unsure if its really a guarantee. It should be, but i would probably simply log the values first instead of maybe blowing up the builds [14:54:32] * ebernhardson should look at your patch :P [14:54:45] my patch adds nothing new [14:55:19] it just uses search_after which I think will properly fail if it can continue [14:55:27] *can't [14:55:31] dcausse: that looks reasonable, i was thinking about the same. I guess i had sorted by page_id, but _id probably has similar characteristics [14:55:47] yea i glanced over the search after code yesterday, it looks like it should properly fail [14:56:06] was not sure about page_id if it was properly reindexed or not [14:56:31] but if it is yes we should be definitely prefer this one [14:56:39] in theory it is since we just ran a reindex against everything, all indices start with 17 which is pretty recent [14:57:02] the second scroll on the other hand is on the suggest index itself which does not have page_id for sure [14:57:39] oh, right. _id is probably fine, i suppose i expect them to behave pretty similarly but with different exact values [14:58:20] page_id should be quicker because of docvalues but it might not be very noticeable [14:59:38] we should probably also create a ticket about the network issues...i intended to write a patch yesterday but then got distracted trying to narrow down where exactly the network issues are. Then ran out of time :P [15:00:40] had it run 600 requests over 10 hours overnight, looks like 16 failed with no route to host. [15:00:52] (just sleep awhile and make a request) [15:01:15] it's apparently due to new hosts being added to lvs, Brian is looking into it [15:01:36] alex removed them from pybal and the errors stopped [15:01:43] ahh, ok nice [15:01:57] I'm having trouble reproducing the errors from a "bad" host [15:02:39] just doing `curl -s --resolve https://search.discovery.wmnet:9243:127.0.0.1 https://search.discovery.wmnet:9243` from 1103 [15:04:11] was thinking of trying some queries similar to the ones that failed...if anyone has other ideas LMK [15:04:17] in a quick test i'm not getting any fails from mwmaint -> search.svc.eqid.wmnet, but that's expected with them depooled. [15:04:22] inflatador: whats the list of bad hosts, just 1103? [15:04:33] 1103-07 [15:05:25] none of them failed out of LB rotation and they all look healthy AFAICT [15:05:59] my best guess is that timeouts (or whatever is happening) only manifest when you're doing a query [15:07:08] nothing in nginx error logs, but it was probably a connectivity issue so not a huge surprise there [15:07:12] i wasn't seeing timeouts, i was seeing no route to host [15:07:52] manually testing queries from mwmaint to the servers, but not seeing the same errors as yesterday yet. Only checked one though, will step through them [15:07:57] interesting. that was just from a mwmaint server? [15:08:14] mwmaint1002, yea [15:08:29] basically just a python while True: loop that would sleep then request the url [15:08:45] but it was queries as opposed to just the base URL? [15:08:56] no this was querying the banner, at / [15:09:15] i mean i was suspicious because i saw cirrus getting these with queries, but i tested against the banner page [15:09:36] if that's sufficient to trigger it , that really sounds like a nw issue [15:10:19] spammed about 1k requests at each of 1103-1107 from mwmaint1002, not seeing any failures right now :S [15:10:59] perhaps we can repool them one at a time to try to isolate the culprit if only one is misbehaving? it's pretty obvious in logstash when things start to break [15:11:32] seems reasonable, and not too hard [15:11:47] y, we can try it. Although I'm still trying to reason out exactly what happened [15:12:30] seems like a nw issue to me, something flaky with the routes [15:13:10] similarly, it seems like something we need a decent reproduction for and then pass to someone who knows how that works [15:13:28] agreed...give me a sec and I'll repool 1103 [15:14:54] dashboard to watch: https://logstash.wikimedia.org/goto/4422d82bb504e1b4fb22513b2a4fd2e5 [15:15:05] OK, 1103 is pooled and at higher weight than other hosts [15:16:50] here too: https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-origin=appserver&var-origin_instance=All&var-destination=search-https_eqiad [15:16:56] upstream error rate from envoy [15:19:02] if I'm not mistaken https://gerrit.wikimedia.org/r/c/operations/puppet/+/1023953 was merged after pooling these hosts could this have caused an issue? [15:22:26] not seeing any errors yet [15:22:50] if you get rid of "connect error: 113" I think there are some [15:23:16] inflatador: these are expected I think [15:23:48] `Search backend error during more_like search for 'morelike:Reza_Aslan' after 254: unknown: Status code 503; upstream connect error or disconnect/reset before headers` [15:24:47] inflatador: not seeing anything suspicious yet, these 503 happen sometimes [15:24:52] got it [15:25:53] 1/ 21:30 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1023937 gets merged; 2/ 21:35 conftool action : set/weight=10:pooled=yes; selector: name=elastic110[3-7]\.eqiad\.wmnet => errors start; 3/ 23:30 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1023953 gets merged [15:26:04] should step 3 have gone before step 2? [15:27:34] I think we can pool another one? pretty the errors would have started already [15:27:39] *sure [15:28:02] i agree [15:28:28] that could've caused a transient problem, you'd think it would go away though as the hosts were already part of the cluster [15:29:41] no clue... but if there was a stale state somewhere perhaps? [15:29:51] pooling the server beforehand was definitely a mistake, but without elastic installed they would probably fail health checks. And once part of the cluster they should have been able to respond [15:30:03] we definitely saw them as part of the cluster during pairing yesterday [15:30:13] anyway, I will try pooling another [15:33:02] 1104 is pooled [15:33:10] same as before, higher than "good" hosts [15:33:34] heading out, have a nice week-end! [15:34:00] enjoy! [15:34:10] inflatador: still looking ok, although we can wait a minute [15:38:08] still no errors, keep going? [15:43:10] ebernhardson ACK, just repooled 1105 [15:45:18] inflatador: that got a bump [15:45:24] not a big one yet though [15:45:39] oh yea, thats doing it [15:45:50] 26 out of 600 req's failed with no route to host [15:46:06] probably higher now than yesterday due to higher weighting [15:46:11] interesting [15:46:21] let's look at that guy in netbox [15:55:00] ebernhardson topranks started up https://meet.google.com/nnh-wmra-gzr when/if would like to join [15:56:34] depooled 1103 and 1104 [16:02:31] https://etherpad.wikimedia.org/p/search-503s has it [16:41:47] and to explain that's why all our manual curls and pings etc worked [16:42:16] it's only genuine traffic hitting the LB, where it tries to only rewrite MAC and needs to send it direct at L2, would fail [16:42:41] when we ping traffic follows the default route out and goes out and back via the switches and is fine [16:43:20] makes sense, appreciate looking into it :) [16:49:24] np - sorry basic error on our part [16:49:34] fwiw if you depool 1105 and 1107 for now that would be good [16:49:41] I think you should not see errors in that case [16:50:00] I'm working on the patch to add the missing vlan sub-ints on our load-balancers for the vlans those hosts are on [16:59:45] topranks ACK, both are depooled [17:01:14] topranks we closed down the Meet, I'm going to work on incident report/monitoring. Thanks again for your help and LMK if you need anything [17:02:36] ok cool [17:03:10] you might give me a +1 for this. I appreciate it may not mean much but nobody in traffic has got back to me yet and we need to add it [17:03:23] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1024776 [17:05:11] topranks ACK, +1'd [17:05:17] thanks [17:05:19] will merge now [17:06:45] going AFK for ~20m [17:20:15] ok I think we should be ok now [17:20:40] cmooney@lvs1019:~$ ip route get fibmatch 10.64.152.2 [17:20:40] 10.64.152.0/24 dev vlan1047 proto kernel scope link src 10.64.152.19 [17:20:48] whereas previously it showed: [17:21:03] cmooney@lvs1019:~$ ip route get fibmatch 10.64.152.2 [17:21:03] default via 10.64.32.1 dev eno1np0 onlink [17:21:34] key difference is it's now directly on the vlan and can arp for the destination, allowing the MAC rewrite [17:21:37] cmooney@lvs1019:~$ ip neigh show 10.64.152.2 [17:21:37] 10.64.152.2 dev vlan1047 lladdr 14:23:f2:c2:96:e0 REACHABLE [17:25:04] back [17:25:32] Ah, looks good then! [17:25:40] thanks again for your help [17:28:31] ^^ pooled elastic1103-07 in eqiad, watching for errors. Per conversation with ebernhardson , we're going to wait until Monday to repool at the DC level [17:28:34] no probs - do you want to try and repool one of those servers with a high weight and let's see? [17:28:39] ok cool [17:28:43] sure [17:29:28] I'll run the mwmaint script again a few times too [17:29:51] cool [17:40:48] incident report up at https://wikitech.wikimedia.org/wiki/Incidents/2024-04-26_Missing_Search_suggestions_in_eqiad , still filling it out [17:48:27] so new errors in logstash, I think the new hosts are working now [17:48:36] errr..."NO" new errors in logstash that is [17:52:23] ok nice [17:52:31] yeah i can see on the lvs that it's sending traffic to them no problem [17:52:43] I replied on task and added a little text to the incident report [17:53:04] may need to flesh out the incident report a bit more, I have to step out shortly though so will need to revisit next week [17:57:56] Yeah np [18:00:07] I'm still not fully understanding this...the index completion was already broken ( T363516 ) before we brought the new hosts into prod ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/1023937 ) [18:00:07] T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw - https://phabricator.wikimedia.org/T363516 [18:00:24] which ui shows alerts like https://phabricator.wikimedia.org/T359215 ? [18:00:59] it looks like a mailing list history search maybe? [18:01:25] I believe tchin was working on something like this as part of the alerts review? [18:01:41] https://dpe-alerts-dashboard.toolforge.org/ ? [18:05:23] hmm, trying but not finding the same info [18:06:23] has similar ui colors to the hyperkitty (lists.wikimedia.org) archive search, but not finding that ui [18:06:27] maybe it ages out or something? [18:06:42] i mean finding something that has a search with the results shown as a histogram [18:07:04] i suppose i am trying to reproduce that query, but i was using the histogram search as a proxy for finding the right tool :) [18:07:05] oh yeah, not sure then [18:07:18] anyway, hitting up lunch, back in ~40 [18:47:23] back [18:48:54] this incident report's going to be a bit confusing since the phab task we've been working out of is unrelated AFAICT [18:49:08] I moved the incident report to https://wikitech.wikimedia.org/w/index.php?title=Incidents/2024-04-26_Search_unavailable_for_some_eqiad_users to make it less confusing [18:53:14] maybe the indexing failure is a knock-on effect from the bad hosts in the pool? [18:53:44] inflatador: it's related, 1% of requests failing triggered a problem in the completion builder that failed a rebuild early but still promoted it [18:54:18] ebernhardson ACK, thanks for clarifying [18:55:20] suggests our code is broken in more places than we know, maybe we need 1% of requests to fail more often :P [18:59:27] Yeah, it's a good learning experience ;P [19:16:42] I’m glad I was able to help out :P [19:19:18] ever forward! [19:19:28] Going AFK, back in ~40 [20:26:46] back [20:59:25] * ebernhardson thinks he resolved the issue with counting external search engine pageviews and is re-calculating the t358345 table [20:59:48] it was actually stupidly simple, i just had to directly count them like everything else instead of being too-smart and retaining values to resolve later [21:18:27] Brief followup on this tenative timeline from david as far as when the new elastic* hosts were pooled relative to patches being merged: `1/ 21:30 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1023937 gets merged; 2/ 21:35 conftool action : set/weight=10:pooled=yes; selector: name=elastic110[3-7]\.eqiad\.wmnet => errors start; 3/ 23:30 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1023953 gets merged` [21:18:51] I double-checked and the above times for the gerrit patches are wrong, here's the corrected timeline: [21:18:56] 19:27 first patch merged, 21:22 second patch merged, 21:35 hosts pooled [21:24:25] https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&from=now-2d&to=now&var-datasource=eqiad%20prometheus%2Fops&var-origin=appserver&var-origin_instance=All&var-destination=search-https_eqiad&viewPanel=16 [21:26:23] https://phabricator.wikimedia.org/P61225 [21:35:24] Looks like only autocomplete was failed over to codfw: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1024478/2/wmf-config/ext-CirrusSearch.php [21:36:27] Probably more impact than we thought, then [21:43:11] right, we only moved autocomplete since the index builds failed and we were returning bad results [21:44:37] but we still had the 1% error rate from Envoy [21:44:50] right, because there was still both update traffic and user traffic [21:44:56] related articles, fulltext, etc. [21:45:06] so as far as user impact, I guess that was timeouts [21:45:25] looks like it should have been quick failures, with a message that says please try your search again [21:45:37] or autocompletes and related articles probably didn't display