[08:31:17] hello folks! [08:31:27] I am checking https://logstash.wikimedia.org/app/dashboards#/view/memcached [08:31:39] there seems to be an isseue with a mwmaint script afaics [08:32:24] but I am a little ignorant about labswiki and its memcached settings [08:32:27] anybody has context? [08:35:30] elukey: there is a conversation about it on operations [08:35:44] group2 train has been rolled out and seems to have an issue [08:35:58] nono it is a different issue [08:36:01] hashar: o/ [08:36:44] the draft namespace issue is unrelated afaics, the memcached errors seems to be related to labswiki/wikitech (triggered by a mwmaint script) [08:36:53] probably nothing but I wanted to clear them out [09:08:51] an example is https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-mediawiki-1-7.0.0-1-2024.07.04?id=ZVkAfZABbv8L2HoqSjNQ [09:09:14] all coming from mwmaint1002, but I don't see errors in mwmaint's mcrouter metrics [09:09:27] so it must be something else but can't find the culprit [09:31:07] effie: o/ any idea how to debug this --^ ? [09:36:12] elukey: I think I am missing a lot of context so you will have to give me some time [09:38:24] sure sure [09:39:53] the main issue is that I am not 100% sure where the memcached failure to contact host happens [10:01:15] <_joe_> elukey: wikitech doesn't use mcrouter [10:01:40] okok makes sense then [10:01:59] <_joe_> so if it's trying to use mcrouter, there's a bug in mediawiki-config [10:02:57] nono I just noticed "Memcached error: SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY" coming from mwmaint1002 [10:03:09] so I was wondering if it was due to the local mcrouter being in a bad shape [10:03:27] the script that is running is /srv/mediawiki-staging/multiversion/MWScript.php extensions/CirrusSearch/maintenance/ForceSearchIndex.php --wiki labswiki [10:03:32] and triggering the failures [10:04:06] but nowadays I am totally ignorant where to check :) [10:04:16] dcausse might know more... [10:04:44] <_joe_> gehel: I doubt it has to do with the code [10:05:01] <_joe_> elukey: any idea when these started appearing? [10:05:28] yeah, 7:15 UTC - https://logstash.wikimedia.org/app/dashboards#/view/memcached?_g=h@41d3bb7&_a=h@e82a689 [10:05:42] It must be a systemd timer kicking off the script [10:05:59] or something failed at 7:15 :D [10:07:02] <_joe_> no I mean [10:07:13] <_joe_> can you find them 10 days ago? [10:08:22] some yes https://logstash.wikimedia.org/goto/2e0513c0e7c2ed8dc5785b18db821862 [10:08:25] but not at this rate [10:08:26] <_joe_> because I'm looking at the code and I don't see how this can work on labswiki [10:08:53] <_joe_> for labswiki memcached is called on 127.0.0.1:11212 [10:09:05] <_joe_> which is *nutcracker* in cloudweb [10:09:18] no ok wait, if I isolate for mwmaint I don't see them [10:09:37] where are you looking? [10:12:46] <_joe_> sorry, I must have misunderstood [10:13:00] <_joe_> elukey> there seems to be an isseue with a mwmaint script afaics [10:13:07] <_joe_> I assumed it meant it was running on mwmaint [10:14:10] <_joe_> and yes, all the messages are from mwmaint1002 [10:14:36] <_joe_> so the problem is we're running a maint script for wikitech from mwmaint1002 [10:15:22] <_joe_> dcausse: maybe it's you indeed after all [10:15:45] <_joe_> :D [10:35:00] <_joe_> elukey: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1052075 might be a good idea [10:55:43] definitely, will review it after lunch, thanks! [12:09:30] Codesearch is down at the moment, with some search indices not starting: https://codesearch.wmcloud.org/_health/ - Should I look into it myself, or is someone else better placed to do so? Thanks. [12:16:54] btullis: Amir1 is probably the best at this time [12:16:59] How long has it been down? [12:17:33] There's some docs at https://www.mediawiki.org/wiki/Codesearch/Admin [12:18:05] btullis: for now don't use "search" backend ("Everything") [12:18:46] I will take a look [12:20:54] Thanks. Sorry, I didn't mean to put pressure on. I'm happy to investigate, but I've never done so yet. [12:46:42] _joe_: yes it's me, sorry did not know that was not supported [12:46:58] <_joe_> dcausse: no problem it's 100% understandable [12:47:05] <_joe_> that's why I created the above patch [12:47:09] <_joe_> footguns-- [12:47:11] thx! [12:52:48] !bash <_joe_> footguns-- [12:52:48] Lucas_WMDE: Stored quip at https://bash.toolforge.org/quip/WajNfZABFk7ipym_9gaf [13:02:24] we have a system timer mediawiki_job_cirrus_build_completion_indices_eqiad.timer on mwmaint1002 which is running mwscript over /usr/local/bin/expanddblist all [13:02:34] https://gerrit.wikimedia.org/g/operations/puppet/+/dd748a51e25a020b42c467009fa6a6dd1b97eba0/modules/profile/files/mediawiki/maintenance/cirrus_build_completion_indices.sh [13:03:16] so it might be hitting labswiki [13:03:48] <_joe_> ^_^ [13:04:41] this one might not rely too much on memcache tho (if this is the main issue) [13:05:53] happy to exclude labswiki there but wondering from where maint scripts should be run for labswiki [13:10:35] <_joe_> it's not the only one [13:10:51] <_joe_> dcausse: I guess from cloudweb? 301 wmcs though :) [13:14:56] <_joe_> dcausse: so I think the problem is a bit wider than I anticipated, there's plenty of scripts running for labswiki from the mwmaint server, o dear [13:17:27] _joe_: ack, please lemme know if I can be of any help [14:09:28] I restarted the systemd service for everything [14:13:58] <_joe_> Amir1: ? [14:14:06] codesearch [14:14:13] it has a service for "everything" [14:14:22] I should have been clearer, sorry [14:14:45] just a little cumin '*' 'systemctl restart *' [14:14:48] easy [14:15:35] no no, that would be too chaotic, you always first start with databases cumin 'db*' 'systemctl restart *' [14:16:09] true true [14:21:10] thanks for the restart [14:22:46] so what was the issue in the end, something related to labswiki or...? [14:22:50] out of curiosity