[00:14:00] Set some timeouts in AnomieBOT's code, now it's giving lots of Error while reading from "Redis server: Resource temporarily unavailable" but at least not hanging the jobs for so long anymore. [00:35:32] Hm, running this Python 2 script directly on the bastion host is failing, but it seemingly fine through an exec host. [00:35:56] it's working seemingly fine * [00:37:13] I guess I'll route one-off manual runs via jsub or switch to Python 3. [00:46:43] you should do both of those things [01:55:50] Probably but I've been using the same code since like 2008 and I'm lazy. [01:59:49] AntiComposite: Do you use Python much? Tryna figure out what the best Python client to use is among https://www.mediawiki.org/wiki/API:Client_code#Python [02:00:58] I mostly use pywikibot, mwclient is also a decent option if you want less of a framework and more of a library [02:49:41] All right, thanks. [07:33:20] taavi: if you're around, can you restart wikibugs please? looks like you're a maintainer [08:54:29] Like anomie, I get a lot of redis exceptions: "read error on connection", I am using PHP [08:58:48] Wurgl: what does your code look like? my tests work fine [09:02:51] $ret = $this->redis->set($key, "1", ['nx', 'ex' => $timeout]); <-- does not help much, I think [09:03:18] It does not happen every time, just all ~100 (guessing) calls [09:04:15] taavi: I am using redis to limit concurring scripts writing too often (max: 10 edits per minute as deWP allows) [09:08:23] It smells(!) like the behaviour has changed. Before it returned null and now I get an exception. [09:14:07] Wurgl: we did indeed upgrade the redis server and the os it runs on, although I don't remember anything related to timeouts in the redis changelogs or in what we changed [09:14:13] which hostname are you using to connect to redis? [09:14:29] tools-redis [09:14:37] port 6379 [09:15:07] can you tell if it happens if you connect directly to tools-redis-5.tools.eqiad1.wikimedia.cloud, bypassing the load balancing layer? [09:16:20] Well, with another tool I can try … takes some minutes [09:27:59] 2022-05-23 09:27:14 Exception: read error on connection [09:28:12] Same, every second an error [09:28:26] 2022-05-23 09:27:15 Exception: read error on connection [09:29:11] … 16, 17, 18, 19, then there was some data and it got processed [09:32:43] In this tool the code is $key = $redis->blpop($queues, 3600); with $queues is an array with three strings [10:16:41] looking at the redis logs it seems it's restarting twice every hour or so [10:18:32] puppet does that: May 23 10:11:21 tools-redis-5 puppet-agent[1923824]: (/Stage[main]/Profile::Toolforge::Redis_sentinel/Redis::Instance[6379]/File[/etc/redis/tcp_6379.conf]) Scheduling refresh of Service[redis-instance-tcp_6379] [10:27:15] hmm, that could indeed explain some of the weirdness [10:27:30] the config is being rewritten over and over too [10:27:57] the overwritten one has "# Generated by CONFIG REWRITE" in it [10:28:31] and some of the options seem quite different, for example: (+ is puppet, - is whatever was there) [10:28:33] -client-output-buffer-limit normal 0 0 0 [10:28:35] +client-output-buffer-limit slave 512mb 200mb 60 [10:34:16] I think it might be sentinel self-managing things? [10:34:28] yeah, probably [10:34:34] I'll disable puppet and have a look later [10:34:40] 👍 [10:34:58] if you create a task subscribe me to it, I might have some time later too [10:36:35] "Normally Sentinel rewrites the configuration every time something changes in its state (in the context of the subset of the state which is persisted on disk across restart)." yep, looks related yes [10:38:02] dcaro: T309014 [10:38:02] T309014: sentinel and puppet overwriting toolforge redis config - https://phabricator.wikimedia.org/T309014 [10:38:08] thanks! [10:38:18] I'll go have some lunch now, be back later [14:32:04] <_joe_> hi, wikibugs has been out of service since this morning at least [14:32:28] _joe_: yes, that's tracked here: T308995 [14:32:28] T308995: wikibugs has stopped showing phab/gerrit comments on IRC as of 2022-05-22Z17:00 - https://phabricator.wikimedia.org/T308995 [14:33:26] I tried looking a bit into it earlier, but couldn't figure out how to get that irc framework to log all communication with the server [14:34:14] <_joe_> taavi: how did you get the error logs even [14:34:29] <_joe_> I tried to look at kubectl logs but there's nothing of substance there [14:34:52] it logs them to `redis2irc.log` on the tool's home directory [14:35:08] hmm, might this be related to the redis issues? [14:35:18] <_joe_> dcaro: which redis issues? [14:36:00] redis has been behaving strange lately, some users reported having connection issues, and it was restarting twice per hour (see T309014) [14:36:00] T309014: sentinel and puppet overwriting toolforge redis config - https://phabricator.wikimedia.org/T309014 [14:42:23] dcaro: doubt that, since wikibugs's redis2stdout.py script sees the messages and the error log I found shows something weird with the irc connection [14:46:24] sounds right, let me know if I can help [17:01:04] Hi, I'm the maintainer of refill on Toolforge. The tool is stuck, and in the past bstorm has done the necessary to restart it, but I gather she's no longer employed at WMF. Is there someone else that can help? [17:07:18] CurbSafeCharmer: hello! as the tool maintainer, you should be able to restart the tool yourself [17:07:41] looks like that tool is a web service, so running ´webservice restart´ as the tool account should restart it [17:08:06] will try that, thanks [17:34:13] taavi: bstorm seems to have deleted the pods previously [17:34:21] https://sal.toolforge.org/tools.refill-api [17:34:40] CurbSafeCharmer: were you running that in -api or the main tool [17:39:38] @taavi I ran webservice restart, no joy. I think Brooke had to furtle with some Kubernetes stuff when this last happened. Let me see if I can find the last issue in Phabricator [17:43:21] @taavi apparently she 'deleted pods' https://phabricator.wikimedia.org/T272483 [17:45:32] CurbSafeCharmer: that's exactly what 'webservice restart' does on the 'refill' tool :/ [17:46:30] how are those workers supposed to be running? [17:48:27] Good question taavi - this was set up by someone who is no longer active, and I never had a proper handover [17:48:38] @TheresNoTime are you around? [17:49:15] CurbSafeCharmer: hey [17:49:31] Hi there. Little help?! [17:50:07] CurbSafeCharmer: I don't have access to the refill tool :/ [17:50:18] I can sort that, I think! [17:50:52] https://toolsadmin.wikimedia.org/tools/id/refill looks like you should be able to :) [17:51:21] * TheresNoTime has a shell name of `samtar` for legacy reasons, so you won't find `theresnotime` \o/ [17:54:18] done for refill and refill-api (but not for refill-dev as I am not listed as a maintainer) [17:54:38] CurbSafeCharmer: okay, I'm in :) lemme take a look [17:55:29] You're a star [17:59:50] @RhinosF1 sorry for ignoring you, just spotted your messages. I ran it on the wrong tool in the first instance but spotted my error and ran it again on refill-api [18:17:19] CurbSafeCharmer: can you confirm its working? [18:20:33] @TheresNoTime yup, that's fixed it! [18:21:06] what did you need to do? [18:21:09] CurbSafeCharmer: https://phabricator.wikimedia.org/P28363 & https://phabricator.wikimedia.org/P28362, I'll write up what I did to resolve it (though mainly it was deleting the deploy and redeploying it) [19:21:36] !log deployment-prep Deleted deployment-elastic0[5-7] in favor of newer bullseye hosts T299797 [19:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL [19:21:40] T299797: Deploy new bullseye elastic cluster nodes on deployment-prep - https://phabricator.wikimedia.org/T299797