[08:04:31] !log toolsbeta tools-manifest 0.24, T290325 [08:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [08:04:37] T290325: tools-manifest broken on toolsbeta - https://phabricator.wikimedia.org/T290325 [08:08:28] !log tools update tools-manifest to 0.24 [08:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:56:53] is somethig up with the toolforge proxy? :/ im getting 502s [09:01:00] repeatedly? [09:01:04] occasional 502s are a known issue [09:01:55] everything seems to go a bit slow, and quite frequent 502s for me but nothign seems wrong with my tool [09:02:42] * majavah pokes at dashboards [09:03:36] (the Phabricator task I meant would be https://phabricator.wikimedia.org/T282732) [09:04:59] yeah, I see a few traffic spikes, and kubernetes ingress is known to not handle them very well [09:05:30] gotcha, cool, as long as I don't need to investigate why my tool isnt works :D [09:05:32] *working [09:05:52] What ingress are we using? [09:06:12] ingress-nginx [09:07:38] URL? [09:08:20] For mine you can see this quite a bit at https://backstage.toolforge.org/catalog I believe [09:08:32] Who knows, maybe i am the cause of the traffic spikes xD [09:08:43] but i doubt it, this tool is 2 days old [09:18:45] there was a spike on 5xx errors not so long ago [09:18:47] https://usercontent.irccloud-cdn.com/file/sWEZnKlR/image.png [09:26:12] just got one on the versions tool too [09:34:27] yup, still getting them :( [10:05:37] will keep an eye on this, if the issue continues I'll investigate deeper [15:23:06] addshore: +1, I also get more 502 than usual [15:55:16] arturo: the logs from the ingress cluster roll over so quickly (because there's so much of it) that I usually miss the logs from the time when these events happen. If you happen to see one happening, Doing a logs --tail 1000 on the ingress label would be good. The closest I ever came to a useful bit of data from that end was actual 500s from tools that really had errors. [15:55:38] The problem *might* secretly be the front proxy blowing up lua processing silently, though. [15:56:10] But yeah, if you see a spike like that, I'm eager to capture logs from the moment one happens :) [15:56:10] we do haproxy in tcp and not http mode, right? [15:56:22] Yes, I believe so there. [15:56:32] We use http mode for one endpoint in paws [15:57:10] It's because tls is terminated at the front proxy, iirc [15:57:28] In paws the termination is at haproxy [16:02:57] At some point my theory was that we were running out of nginx workers on the ingress-nginx level [16:08:17] will keep an eye [16:08:59] if workers is the problem, perhaps a simple solution is to add yet another k8s ingress node to the rotation [16:36:47] Definitely, scaling the ingress is pretty easy [19:31:22] !log tools.wikibugs restarted libera-phab to pick up new "In progress" status [19:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL [19:33:13] !log tools.wikibugs restarted libera-irc to pick up new "In progress" status (didn't actually need to restart libera-phab) [19:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL [19:35:51] and now it can't reconnect :/ [19:40:24] > 2021-09-15 19:39:24,962 - irc3.wikibugs - CRITICAL - connection lost (23385571927936): None [19:42:13] ^ me running it manually [19:45:05] the delay in logs coming across over NFS is really frustrating :/ [19:48:02] > Exception ignored in: [19:58:17] I'm utterly stumped, it works fine on the exec node itself if I run it manually, but not when run under the grid [20:01:12] !screen [20:01:12] $ script /dev/null (https://wikitech.wikimedia.org/wiki/Screen#Troubleshooting) [20:03:02] !log tools.wikibugs redis2irc is running in a screen because of T291129 [20:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL [20:03:08] T291129: wikibugs failing to connect when run on exec hosts - https://phabricator.wikimedia.org/T291129 [20:03:21] apologies in advance for doing a v. bad thing [20:05:47] legoktm: don't forget the wheel of fortune can kill some scripts [20:07:22] sigh [20:07:44] I'll ask for help from the cloud team after I eat lunch [20:08:28] Today is a day of sighing [20:10:43] legoktm: I think screen is ok actually https://github.com/wikimedia/puppet/blob/a5144914e0cc1f777f70c3482c91ce6a84240160/modules/toolforge/files/wmcs_wheel_of_misfortune.py#L38 [20:11:59] the Python process will still get killed [20:12:21] Ah :( [20:50:09] legoktm: did you change anything other than config? That traceback looks like deep asyncio problems on the surface, but I'm no asyncio nerd to validate that. [20:51:07] nope [20:51:20] specifically I changed https://gerrit.wikimedia.org/r/plugins/gitiles/labs/tools/wikibugs2/+/acc3360b08b4f9807e0fb45067ed8e97bf351c36%5E%21/#F0 [20:51:38] and there were no undeployed changes afais [20:51:46] wacky [20:53:19] unfortunately valhallasw is the asyncio expert, I barely know enough to get by [20:54:24] ...and if my change was the breaking cause, then it should break when run manually too [20:54:47] The "runs on a grid exec node, but not under grid control" part is the most confusing to me. There really shouldn't be much different there. [21:48:36] omg [21:50:28] !log tools.wikibugs switched config to use tools-redis.svc.eqiad.wmflabs as redis host and now it seems to work [21:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL [21:50:33] h/t to b.storm for suggesting ^ [23:43:41] !log tools.iabot truncated massive files Worker[1-5].out T288300 T288276 [23:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.iabot/SAL [23:43:47] T288300: IAbot is writing loads of text to Toolforge NFS at a high rate - https://phabricator.wikimedia.org/T288300 [23:43:48] T288276: 2021-08-05: Tools NFS share cleanup - https://phabricator.wikimedia.org/T288276