[02:05:41] It looks like anticompositebot hasn't been able to hit centralauth.web... since the 14th, and I can't connect to it from the bastion [02:05:55] I can connect to centralauth.analytics... though [02:05:56] https://phabricator.wikimedia.org/P16867 [06:51:03] something's up with FontCDN: most requests for font CSS files are returning as 502 [06:59:03] Hi chlod [06:59:22] We were discussing this (https://en.wikipedia.org/wiki/Wikipedia_talk:RedWarn#FontCDN_issues) [07:07:57] chlod, GooseTheCat: all the example URLs there are on the redwarn tool and not on the fontcdn mirror (tools-static.wmflabs.org/fontcdn/), do you have examples of broken mirror urls? [07:08:38] no. any way to get them? I'm not a pro [07:10:01] i have no clue how that tool works, you need to ask its maintainers (https://toolsadmin.wikimedia.org/tools/id/redwarn) [07:10:39] majavah: here's an HAR of a failed request https://phabricator.wikimedia.org/P16868 [07:11:03] opening https://fontcdn.toolforge.org/ in general just shows a bunch of fallback serif fonts [07:13:44] Going to the link shows *502 Bad Gateway* [07:14:01] IDK if I'm helping or not =L [07:15:14] "[error] 21643#21643: *13950138 no live upstreams while connecting to upstream," in the nginx logs, but I can connect to it via curl from the static server [07:15:47] !log tools restart nginx on tools-static-14 to see if it helps with fontcdn issues [07:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [07:16:32] that seems to have unbroken it? [07:16:48] I'll try [07:17:40] Fixed [07:17:56] Thanks majavah [07:20:52] weird, I'm not seeing anything in the logs that suggests why it happened :/ [07:22:58] ah well, I'm happy that it got fixed, I'm not sure if I can help any further [09:15:39] AntiComposite: looks like a copy-paste fail in the dns name generation file, https://gerrit.wikimedia.org/r/c/operations/puppet/+/707274/ will fix it once reviewed + deployed [13:31:40] !log toolsbeta upgrading toolsbeta to kubernetes 1.19, T280340 [13:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [13:31:46] T280340: Upgrade Toolforge Kubernetes to latest 1.19 - https://phabricator.wikimedia.org/T280340 [15:22:20] !log admin update wikireplicas-dns for s7 fix for web replicas [15:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:16:27] Cyberpower678: I cleaned up corruption on cyberbot-exec-iabot-01, removed all the files in MemoryFiles and rebooted. No promises about the state of the filesystem elsewhere though [16:16:50] Ok. Thank you [16:16:52] :-) [16:18:55] The misconfigured bot running on nat.cloudgw.eqiad1.wikimediacloud.org repeatedly failing to authenticate as icinga-wm is still going which risk getting the whole host banned if the SASL failure allow-rate changes [16:23:42] bd808: is ^ something you know anything about? [16:24:01] fyi glguy, that is a gateway host so it's hard for me to know where the traffic originates [16:24:15] ^ (or godog?) [16:24:42] I think a past theory was godog and some o11l test infra... [16:25:09] * bd808 peeks in puppet to see if a related class is obvious [16:26:42] pontoon-icinga-01.monitoring.eqiad1.wikimedia.cloud sounds like a good host to check, I'd imagine it's some falling back to production defaults [16:27:04] that same address handles CBNGRelay, which seems to have a really hard time staying connected. Does that bot do anything? [16:27:54] I don't see any successful connections with an alternative nickname correlating with the icinga-wm failures to help you with [16:28:42] I think majavah guessed the correct node. I'm poking around there now a bit [16:29:21] * andrewbogott chronically annoyed at the pontoon VMs even though they're put to a good purpose [16:31:21] glguy: apparently CBNGRelay has been broken for a few months per https://phabricator.wikimedia.org/T274871, I imagine someone relying on it would have noticed so if it's causing issues we can get it disconnected [16:31:59] glguy: have you ever made a phabricator task about the bad icinga-wm client? Just wondering if I need to make one or can just add some comments on an existing ticket [16:32:33] bd808: no, I haven't make anything in yoru phabricator before [16:33:17] no problem. I'll start a task and maybe put on a bandaid to stop the bot for now if it works the way I think it does [16:33:36] do we know what nick it's trying to auth with? [16:33:38] the auth and reconnect failures make this host show up on problem reports, the auth failures risk crossing the variable ban thresholds. I'm just trying to get ahead of a bans@ ticket about how your infra got blocked [16:33:57] appreciate it :) [16:33:58] icinga-wm is the nickname for the auth failures [16:34:15] legoktm: It's icinga-wm from profile::icinga::ircbot [16:34:37] hmm, but that runs in prod, why is it in the cloud too? [16:34:48] https://openstack-browser.toolforge.org/server/pontoon-icinga-01.monitoring.eqiad1.wikimedia.cloud [16:34:53] ohh [16:38:10] Filed as T287265 [16:38:11] T287265: profile::icinga::ircbot trying to run icinga-wm without proper password - https://phabricator.wikimedia.org/T287265 [16:38:42] I'm fine with just shutting that VM off if that's an adequate solution; those pontoon nodes are 'rogue' and basically unmanageable by us. [16:39:17] I'm going to try a hiera flag to disable the profile [16:39:30] 'k [16:39:39] I'll be shocked if puppet is running on that host [16:40:15] Notice: Skipping run of Puppet configuration client; administratively disabled (Reason: 'filippo - filippo'); [16:40:17] !log tools.cluebotng stop cbng_relay grid job, still having issues with irc connection - T274871 [16:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng/SAL [16:40:22] T274871: cluebotng: IRC freenode activity causes flood warnings, but there is a way to stop that - https://phabricator.wikimedia.org/T274871 [16:41:56] bd808: is that a second vote for 'shut down'? [16:42:16] getting pretty close [16:42:21] bd808: if puppet is already disabled, `systemctl mask ircecho` ? [16:42:44] glguy: I stopped the CBNGRelay process, apparently it's been broken for quite some time now and not really used [16:45:00] oh, thanks! [16:45:38] !log monitoring `rm /usr/local/bin/ircecho` as the worst fix for T287265 [16:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Monitoring/SAL [16:45:43] T287265: profile::icinga::ircbot trying to run icinga-wm without proper password from monitoring Cloud VPS project - https://phabricator.wikimedia.org/T287265 [16:46:34] thank you bd808 [16:47:01] glguy: I rm'd the script and killed the running instance. Hopefully godog will see the task before turning it back on. [16:49:05] OK, thanks for your help [16:49:36] feel free to ping me if anything else comes up irc-wise [16:49:53] thank you for warning us rather than just slamming the door shut glguy [16:49:55] oops [16:52:20] majavah: Any reason why you stopped the cbng_relay job when the Phab task was regarding when it was on Freenode and there has been no complaints since Jess last commented in May? [16:52:29] (At least, no complaints to me) [16:52:43] we just got a complaint from Libera staff a few minutes ago [16:52:57] yeah, glguy was just here poking about it being too noisy and near tipping us into a k-line [16:53:03] that same address handles CBNGRelay, which seems to have a really hard time staying connected. Does that bot do anything? [16:53:36] RichSmith: set it up with SASL and turn it back on :) T278584 [16:53:36] T278584: Promote use of SASL for Cloud VPS/Toolforge hosted Libera.chat / Freenode IRC bots - https://phabricator.wikimedia.org/T278584 [16:55:29] Ok, I'll leave it off for now... I had told a staffer before that if it doesn't get voice for whatever reason, just punt it off the network and it should rejoin properly. bd808: I will absolutely look in to that [16:57:18] RichSmith: sorry, I was still typing a longer reply in phabricator explaining that :/ [16:57:32] majavah: Ah, righto [18:18:47] I’m trying to work with toolforge-jobs via the API… any idea how I could debug “no pods were created for this job”? [18:20:12] I suspect it’s because I requested 7Gi memory – I feel like the job might still be pending somewhere in kubernetes space, but I don’t know where [18:20:31] which tool? let me have a look [18:20:37] wd-shex-infer [18:21:26] aha, `kubectl get jobs` has it [18:21:36] Warning FailedCreate 9m44s job-controller Error creating: pods "wd-shex-infer-83-ndfm4" is forbidden: maximum memory usage per Container is 4Gi, but limit is 7Gi [18:21:56] maximum memory usage per Container is 4Gi? :/ [18:22:16] looks like yes? `kubectl describe limits` [18:22:31] on the Grid that tool requests 8g and I assume I arrived at that number by doubling it until it worked… [18:22:47] (though that would’ve been two or three years ago so I’m not 100% sure) [18:23:18] `kubectl quota` reported 8Gi as the limit so I thought 7Gi might still work [18:23:29] but I can try 4Gi for now and see if it at least starts. thanks so far! [18:23:44] `quota` is total for usage on all resources on the namespace, `limits` are per-pod limits [18:24:02] err- per container in this case [18:25:26] see https://phabricator.wikimedia.org/T286784 about adjusting the limits for toolforge-jobs [18:27:03] ok, that got me as far as pods being created [18:27:26] now I need to figure out why they’re erroring ^^ I’ll shout if I can’t figure it otu [18:27:29] *out [18:28:13] `kubectl describe ` and `kubectl get events` are your friends when troubleshooting [18:32:19] thanks, kubectl get events looks very helpful there [18:33:53] (ideally the jobs framework would catch of those errors so you don't have to know that) [18:39:27] filed T287275, feel free to file others [18:39:28] T287275: toolforge-jobs: reject jobs with more resource requests than single pods can use - https://phabricator.wikimedia.org/T287275 [18:46:31] ok, so I can create running jobs, now my problem is just that no available container has all the programs I need for the house-of-cards pipeline that this tool wraps :D [18:47:15] tf-buster-std had no `make` – I managed to find an image which has it (tf-python37) but that’s still missing `lsof` and also `java` [18:47:26] maybe this tool isn’t a great candidate for beta-testing toolforge-jobs after all [18:48:53] if you need java, try the java container [18:49:06] well yes but that one doesn’t even have make anymore [18:49:15] (at least the jdk11 one, haven’t looked at the jdk8) [18:49:25] * majavah wonders why the python container has `make` [18:50:00] I would probably have to rearrange the tool significantly in order to make it work [18:50:05] or, more likely, sunset it [18:51:10] yeah, there are plans for sort-of bring-your-own-tools-inside-an-image (buildpacks), but that still has a fairly long way to come to reality [19:06:46] do you need `make` at runtime? [19:06:59] or can you build the tool first and then just deploy it into the java container? [19:07:57] well, the pipeline that the tool wraps was written as a Makefile [19:08:13] could probably be rewritten as a shell script or something else, I’m just not sure the effort is justified [19:08:18] since the tool isn’t used very often [21:18:15] legoktm: I've come up with much stranger names fully awake [21:18:26] haha