[05:29:14] 10serviceops, 10SRE, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Joe) p:05Medium→03High So, this problem clearly hasn't been solved. We need to isolate where the problem is; the easiest way to test is imho as follows: * Tem... [07:39:17] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: kube-apiserver need to reach webhooks running inside of the cluster - https://phabricator.wikimedia.org/T290967 (10JMeybohm) [09:43:50] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Implement POC for istio ingress - https://phabricator.wikimedia.org/T290966 (10JMeybohm) [13:24:52] nemo-yiannis and I are trying to run a cronjob [13:25:59] and we are failing miserably [13:28:49] So we have an issue with running a cronjob on kubernetes for tegola. For long time jobs were failing on staging because of a connection error that we didn't notice. Now after deploying a fix (connect cronjob DB connections to envoy instead of trying to connect directly to DB) the new jobs are not schedule because of this error: https://phabricator.wikimedia.org/P17463 [13:30:06] I assume that the issue here is that it was failing more than the allowed threshold (as in weeks) and the controller doesn't let new jobs to start. [13:30:46] Any ideas? [13:42:32] <_joe_> nemo-yiannis: not without looking at logs, no [13:46:34] <_joe_> so I don't see any cronjob, not even failed, in staging [13:51:20] _joe_: I deleted the cronjob [13:51:22] that is why [13:52:01] <_joe_> effie: ok I just watsted the last 10 minutes trying to understand what I did wrong [13:52:26] <_joe_> effie: it's working now? [13:52:29] oh sorry about that, I deleted exactly when you started looking apparently [13:52:43] very sorry [13:52:50] nemo-yiannis will tell us :p [13:53:06] waiting for the next run [13:55:13] yeah it run [13:59:58] it looks like the fix we pushed for the cronjob to terminate envoy on exit also worked [14:01:18] woew [14:01:21] wow [14:01:28] you can thank jayme for this ^ [14:03:29] <_joe_> effie: it's amazing jayme and I came to the same suggestion regarding terminating envoy in a cronjob [14:03:33] <_joe_> evil minds think alike [14:03:46] <_joe_> did you also use echo | /dev/tcp ? [14:04:03] no, it was a plain curl [14:04:04] <_joe_> that gave bd808's cron that 80's touch [14:04:10] hahahahaha [14:04:10] thank you all! [14:04:12] <_joe_> booo you have curl in the container [14:04:23] <_joe_> that's lavish [14:04:23] I can live with curl in a container [14:04:27] there is no shame in curls [14:07:29] _joe_: there are no shameful ways to make a request, just shameful people [14:09:00] <_joe_> effie: your containers have too much blubber [14:09:19] oh don't you blubber this conversation, that's a low blow [14:09:29] <_joe_> I was fat shaming your container, yes [14:09:54] * _joe_ mediawiki's containers enter the room [14:10:29] I dare you to count the number of containers in a mediawiki pod [14:11:00] <_joe_> effie: we had a 77 GB container once [14:11:18] yeah I saw that, it must be some sort of a record [14:32:25] _joe_: just trying to earn my neckbeard stripes with esoteric bashisms [14:33:07] I guess /dev/tcp is a linux'ism though and not just a bash hack [14:36:28] <_joe_> I don't remember if it worked on older unices [14:36:52] I doubt I ever tried it on other unices [14:38:52] it seems like a thing that would have come from the BSD tree, but that's just guessing [14:42:03] if I were to guess, I would say it is bash-ism [14:42:20] I may look it up later out of curiocity [14:51:58] Looks like it is a bash'ism, or at least bash has special support for it -- https://www.gnu.org/software/bash/manual/html_node/Redirections.html --"Bash handles several filenames specially when they are used in redirections, as described in the following table. If the operating system on which Bash is running provides these special files, bash will use them; otherwise it will emulate them internally with the behavior described below." [15:06:04] The /dev/tcp thing sounds Plan 9 -inspired to me [15:06:25] (but I don't know where the idea originated for sure) [15:06:46] Plan 9 was strong on the everything-via-filesystem-abstractions stuff (including network sockets) [15:07:55] P9 had it under /net though, apparently: http://man.cat-v.org/plan_9/3/ip [16:52:18] 10serviceops, 10Shellbox, 10User-brennen, 10Wikimedia-production-error: Shellbox\ShellboxError: Shellbox server returned status code 503 - https://phabricator.wikimedia.org/T292663 (10dancy) We get a handful of these errors every day (for example 19 in the last 24 hours). There don't seem to be any corres... [17:20:09] I think I need someone with more perms in the eqiad k8s cluster to clean up something for me. I'm trying to get Toolhub's CronJob to terminate properly (the /quitquitquit envoy trick), but I think the prior "stuck" jobs (job.batch/toolhub-main-crawler-1633129200, job.batch/toolhub-main-crawler-1634058000) may need to be deleted. [17:23:44] effie: ^ is that something you could help me with or point me to a better person to bother? [17:36:31] <_joe_> bd808: anyone with global root should be able to; but I'll do it [17:38:46] <_joe_> bd808: where is that atm? [17:39:00] <_joe_> which cluster, I mean [17:39:04] <_joe_> eqiad? staging? [17:39:08] _joe_: toolhub namespace, eqiad cluster [17:40:26] <_joe_> bd808: so what is the problem exactly? [17:40:39] <_joe_> I see the job was correctly deleted and recreated 39 minutes ago [17:41:47] <_joe_> toolhub-main-crawler-1634058000 is actually using crawler.sh already? [17:42:01] _joe_: No confirmed, but I was worried that the `kubectl get job.batch` pods that are still running would confuse the scheduler [17:43:21] <_joe_> I just removed one old job that was marked as not completed [17:43:29] <_joe_> so I shouldn't remove that job, right? [17:43:51] _joe_: if you could del job.batch/toolhub-main-crawler-1634058000 as well I think that would do it [17:45:44] <_joe_> but that job is running [17:46:02] <_joe_> oh is that an old pod without the use of crawler.sh? [17:46:05] <_joe_> I see [17:46:29] <_joe_> sorry, just the usual confusion of looking at describe cronjob instead of describe job [17:47:17] <_joe_> ok, done [17:47:44] <_joe_> ok next crawler will start at the hour [17:48:13] <_joe_> ok, off to dinner! [17:50:45] bd808: sorry, I just saw this [17:52:46] no worries effie, and thanks _joe_ [18:11:53] grrr.. the job.batch/toolhub-main-crawler-1634061600 job did not terminate the envoy container as hoped. I don't know why it did not yet, but I'll poke around in the logs and metadata I can find. [18:24:31] 10serviceops, 10SRE, 10Release-Engineering-Team (Doing): Reduce latency of new Scap releases - https://phabricator.wikimedia.org/T292646 (10Legoktm) >>! In T292646#7419245, @jijiki wrote: > The rollout process has to stay as it is though (upgrade on canaries first, and roll out to all hosts after 1-2 days)... [18:59:45] 10serviceops, 10Maps, 10Product-Infrastructure-Team-Backlog: Tegola cronjob cannot connect to kafka-main cluster - https://phabricator.wikimedia.org/T293134 (10Jgiannelos) [19:13:32] 10serviceops, 10SRE, 10Release-Engineering-Team (Doing): Reduce latency of new Scap releases - https://phabricator.wikimedia.org/T292646 (10jijiki) @Legoktm we may debdeploy scap everywhere, and then for whatever reason we need to push change Y fast due to issue X. If scap fails everywhere because of a bug w... [19:28:21] 10serviceops, 10Shellbox, 10User-brennen, 10Wikimedia-production-error: Shellbox\ShellboxError: Shellbox server returned status code 503 - https://phabricator.wikimedia.org/T292663 (10Legoktm) Something must have regressed with envoy, I'll need to look into those logs. [22:17:46] Anybody know where I should be looking to find something between the open internet and my toolhub service in the eqiad k8s cluster that would be blocking a PATCH verb? Maybe the CDN edge or Envoy? [22:19:39] The call I'm making is to PATCH /api/lists/1/feature/ on https://toolhub.wikimedia.org. This is returning a 405 method not allowed response and a "Wikimedia Error" page. It's not reaching my service at all. [22:21:39] well the good news is, we know it's not envoy because we haven't set up custom error pages :D [22:21:42] sounds like CDN edge, there is return (synth(405, "Method not allowed")); in varnish/templates/wikimedia-frontend.vcl.erb [22:21:45] Varnish would be my guess [22:21:51] aha yeah [22:22:43] bam. nice find mutante [22:22:53] "allowed_methods", "^(GET|HEAD|POST|OPTIONS|PURGE)$ [22:23:00] so... this is not awesome [22:23:24] I need PATCH and DELETE for my API [22:25:14] seems like some custom VCL could theoretically allow another method there if req.http.host == toolhub, but that would be a request to traffic team [22:26:39] yeah. Carve outs per service is icky, but maybe the only hack here. I'll open a task and see what the traffic folks think. [22:26:59] sounds good [22:27:11] yeah, but on the other hand I'm guessing the only reason it's blocked is that we've never needed it before [22:27:32] so a per-service carve-out is the less bold alternative to just un-blocking it globally [22:29:43] right, I guess we should be like "hey, traffic, is allowed_methods still current" [22:30:24] yeah sounds right -- but if that turns into a longer conversation, the req.http.host carveout probably unblocks bd808 in the meantime [22:30:35] both plans sound good to me 👍 [22:31:36] hmmm... actually this must be some other thing blocking because I can use PUT verbs and that is not allowed by that block either. [22:31:44] https://en.wikipedia.org/wiki/Patch_verb#Caution [22:32:39] * bd808 looks in misc-frontend.inc.vcl.erb to see if there is something similar [22:34:35] unless maybe `@vcl_config.fetch("allowed_methods", "^(GET|HEAD|POST|OPTIONS|PURGE)$")` has a fetched value that is different than the default there? [22:34:53] hieradata/cloud/eqiad1.yaml: allowed_methods: '^(GET|HEAD|OPTIONS|POST|PURGE|PUT|DELETE)$' [22:34:57] :) [22:35:01] the PUT is in hiera [22:35:20] eh [22:35:20] hieradata/role/common/cache/text.yaml: allowed_methods: '^(GET|HEAD|OPTIONS|POST|PURGE|PUT|DELETE)$' [22:35:33] bd808: yea, that [22:35:33] PUT and DELETE, ok. So it's just PATCH [22:36:03] traffic, why do you hate patch [22:36:36] is it related to the "caution" section in the WP article? i dunno [22:36:48] I'd assume the simple "nobody asked yet" :) [22:37:06] possibly, heh [23:06:57] Filed as T293157 if anyone wants to comment on the task