[08:51:56] 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [08:52:30] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [09:48:56] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Jelto) [09:54:40] 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Jelto) [10:02:04] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10fgiunchedi) [10:04:56] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10MatthewVernon) [10:13:15] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10elukey) [10:16:23] 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10elukey) [10:18:52] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10fgiunchedi) [10:30:04] 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10fgiunchedi) [10:30:48] 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10fgiunchedi) [10:33:12] 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10MatthewVernon) [11:53:33] btullis: you wanna deploy datahub to staging-eqiad first and check if it actually works? [11:53:56] Yep, will do. [11:55:29] cool. lmk and I'll try codfw [11:55:37] *staging-codfw [12:03:32] Looking good on staging-eqiad - feel free to go ahead with staging-codfw [12:34:13] btullis: all good [12:34:16] 🀦 [12:34:57] Great! [12:35:07] mem usage is not yet stable, but the container is not constantly using all available CPU time...so I'm optimistic [12:35:39] another notch in the "I don't like Java that much"-wall [12:37:40] Yeah, I don't like it either :-) Curiously, my `kubectl logs -f datahub-gms-main-7cbddc564f-fr5wh -c datahub-gms-main` quits out and doesn't follow the logs. "too many open files" [12:39:16] same here [12:40:18] ulimit issues? [12:41:30] kubectl logs -f datahub-gms-main-7cbddc564f-fr5wh -c datahub-gms-main works for me [12:41:37] it's the same for the frontend but it does not happen on staging-eqiad [12:41:40] But it wonΒ΄t ctrl-c [12:41:45] Ah there [12:43:07] I reckon that memory usage for the GSM container is pretty stable now at ~ 800 MB [12:46:08] cgoubert@deploy1002:/srv/deployment-charts/helmfile.d/services/mw-debug$ sudo sysctl -n fs.inotify.max_user_watches [12:46:10] 8192 [12:46:12] cgoubert@deploy1002:/srv/deployment-charts/helmfile.d/services/mw-debug$ sudo sysctl -n fs.inotify.max_user_instances [12:46:14] 128 [12:46:16] That's too low [12:46:31] That's why you have log display issues [12:46:54] btullis: agreed. Can you double check that datahub is also working as expected in codfw? :) [12:48:11] claime: but still strange that it works for a container running in staging-eqiad [12:48:19] but not in staging-codfw [12:48:20] It's dumping less logs ? [12:48:28] it's the exat same thing [12:48:31] *exact [12:49:36] :/ [12:59:32] it does not happen for other containers in the cluster or for other containers in the datahub pod as well...so maybe it's an in-container-limit that get's hit. [12:59:42] I'm very much inclined to ignore this [13:01:01] jayme: Struggling a bit. As foolish as it sounds, I never really got around to setting up a decent testing mechanism against staging-eqiad. I'm trying an SSH tunnel like this, but not getting through to the front end: `ssh -N -L 8501:10.2.1.69:8501 kubestagemaster2001.codfw.wmnet` [13:01:56] btullis: isn't that thing behind ingress? [13:02:32] Yeah, I think so. I got the ingress IP from here: k8s-ingress-staging.svc.codfw.wmnet https://netbox.wikimedia.org/ipam/ip-addresses/10019/ [13:02:38] Have I got the wrong port? [13:03:35] istio still throws me at bit, sorry. [13:05:07] ignore what I said about logs --follow above, I did the wrong thing. Seems like it happens for all containers on kubestage2001...I remember there was an issue *a long time ago* of kubelet not closing inotify watchers... [13:06:00] jayme: Yep, that's what it made me think of [13:06:02] ah, you actually need to reach that thing with a browser right [13:07:17] jayme: Yeah, that's the only functional test we have so far other than "does it stay up"? Trying again with port 30443, but I'm not sure how I'm going to frig the servername. [13:08:04] 30443 is the right port. But you will have to trick your browser into sending the right servername for sni [13:09:34] OK, thanks. I'll dig out an extension. How about which host I should use as SSH tunnel endpoint? I was trying kubestagemaster2001 - Should that be OK? [13:11:40] I'd use the deploy host, just to be outside the k8s context... [13:11:50] but anything should work really [13:12:25] root@kubestage2001:~# systemctl restart kubelet.service [13:12:26] Failed to allocate directory watch: Too many open files [13:12:31] this looks promising :D [13:13:03] Yeah we need to raise the inotify limits imo [13:20:44] yes. I'll add a patch raising those to what we use on prometheus nodes (in absense of an idea for proper numbers) [13:20:56] 'fs.inotify.max_user_watches' => 32768, [13:20:57] 'fs.inotify.max_user_instances' => 512 [13:22:51] We can try that, and bump up if it's not enough [13:26:05] btullis: I've raised the max_user_instances temporarily on kubestage2001. logs --follow for datahub works again [13:26:06] There's no real issue with raising these limits to like 8k instances 1M watches (I've done it before), it's 1kB per *used* watch out of kernel memory [13:46:25] jayme: Excellent, thanks. Also confirmed working from browser, with a frigged local `/etc/hosts` [13:46:29] https://usercontent.irccloud-cdn.com/file/PybUeVSj/image.png [13:56:37] πŸ‘Œ [13:56:53] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update staging-codfw to k8s 1.23 - https://phabricator.wikimedia.org/T326340 (10BTullis) [14:39:50] Deploying the new datahub containers to wikikube, for the record. [14:39:57] cool, thanks [16:27:47] btullis: I'm just going over other jre based containers that would potentially need a version bump to work with k8s 1.23 clusters and found https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/864770 [16:27:57] looks like that has been forgotten maybe? [16:29:45] jayme: Yes, I probably forgot to submit it. I was hoping to ping you about spark next week, all things being well :-) [16:31:50] It should be OK just to merge that an build, right? (On Monday'ish) It's not going to impact anyone else is it? [16:33:34] yeah, that should be fine. It would only impact users of the :latest tag for that image... [16:33:45] good luck with pinging me about spark :-p [16:35:45] srsly...sorry that it took so long. :/ I hope I can get back to reviewing it after staging-eqiad is on 1.23 [16:43:50] Ah, it's fine. It's not like either of us were twiddling our thumbs. [19:34:00] 10serviceops, 10Infrastructure-Foundations, 10SRE-tools: Release httpbb 0.0.2 - https://phabricator.wikimedia.org/T328162 (10RLazarus) [19:34:20] 10serviceops, 10Infrastructure-Foundations, 10SRE-tools: Release httpbb 0.0.2 - https://phabricator.wikimedia.org/T328162 (10RLazarus) 05Openβ†’03In progress p:05Triageβ†’03Medium [19:34:50] 10serviceops, 10Infrastructure-Foundations, 10SRE-tools: Release httpbb 0.0.2 - https://phabricator.wikimedia.org/T328162 (10RLazarus) [19:35:04] 10serviceops: httpbb shouldn't alert when large pages are occasionally slow - https://phabricator.wikimedia.org/T323707 (10RLazarus) [20:13:39] 10serviceops, 10Data-Engineering, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10JArguello-WMF) 05Openβ†’03Resolved [20:14:51] 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 07): k8s deployment-charts mesh module should allow use of mesh without public_port Service - https://phabricator.wikimedia.org/T326252 (10JArguello-WMF) 05Openβ†’03Resolved [20:15:01] 10serviceops, 10Data-Engineering, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10JArguello-WMF) [20:15:11] 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 07): k8s deployment-charts mesh module should allow use of mesh without public_port Service - https://phabricator.wikimedia.org/T326252 (10JArguello-WMF) 05Resolvedβ†’03Open [20:15:18] 10serviceops, 10Data-Engineering, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10JArguello-WMF) [20:16:05] 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream: k8s deployment-charts mesh module should allow use of mesh without public_port Service - https://phabricator.wikimedia.org/T326252 (10JArguello-WMF) [23:11:13] 10serviceops, 10Infrastructure-Foundations, 10SRE-tools: Release httpbb 0.0.2 - https://phabricator.wikimedia.org/T328162 (10RLazarus) 05In progressβ†’03Resolved [23:11:26] 10serviceops: httpbb shouldn't alert when large pages are occasionally slow - https://phabricator.wikimedia.org/T323707 (10RLazarus)