[08:51:56] <wikibugs>	 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui)
[08:52:30] <wikibugs>	 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui)
[09:48:56] <wikibugs>	 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Jelto)
[09:54:40] <wikibugs>	 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Jelto)
[10:02:04] <wikibugs>	 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10fgiunchedi)
[10:04:56] <wikibugs>	 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10MatthewVernon)
[10:13:15] <wikibugs>	 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10elukey)
[10:16:23] <wikibugs>	 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10elukey)
[10:18:52] <wikibugs>	 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10fgiunchedi)
[10:30:04] <wikibugs>	 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10fgiunchedi)
[10:30:48] <wikibugs>	 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10fgiunchedi)
[10:33:12] <wikibugs>	 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10MatthewVernon)
[11:53:33] <jayme>	 btullis: you wanna deploy datahub to staging-eqiad first and check if it actually works?
[11:53:56] <btullis>	 Yep, will do.
[11:55:29] <jayme>	 cool. lmk and I'll try codfw
[11:55:37] <jayme>	 *staging-codfw
[12:03:32] <btullis>	 Looking good on staging-eqiad - feel free to go ahead with staging-codfw
[12:34:13] <jayme>	 btullis: all good 
[12:34:16] <jayme>	 🤦
[12:34:57] <btullis>	 Great!
[12:35:07] <jayme>	 mem usage is not yet stable, but the container is not constantly using all available CPU time...so I'm optimistic
[12:35:39] <jayme>	 another notch in the "I don't like Java that much"-wall
[12:37:40] <btullis>	 Yeah, I don't like it either :-) Curiously, my `kubectl logs -f datahub-gms-main-7cbddc564f-fr5wh -c datahub-gms-main` quits out and doesn't follow the logs. "too many open files"
[12:39:16] <jayme>	 same here
[12:40:18] <claime>	 ulimit issues?
[12:41:30] <claime>	 kubectl logs -f datahub-gms-main-7cbddc564f-fr5wh -c datahub-gms-main works for me
[12:41:37] <jayme>	 it's the same for the frontend but it does not happen on staging-eqiad
[12:41:40] <claime>	 But it won´t  ctrl-c
[12:41:45] <claime>	 Ah there
[12:43:07] <btullis>	 I reckon that memory usage for the GSM container is pretty stable now at ~ 800 MB
[12:46:08] <claime>	 cgoubert@deploy1002:/srv/deployment-charts/helmfile.d/services/mw-debug$ sudo sysctl -n fs.inotify.max_user_watches
[12:46:10] <claime>	 8192
[12:46:12] <claime>	 cgoubert@deploy1002:/srv/deployment-charts/helmfile.d/services/mw-debug$ sudo sysctl -n fs.inotify.max_user_instances
[12:46:14] <claime>	 128
[12:46:16] <claime>	 That's too low
[12:46:31] <claime>	 That's why you have log display issues
[12:46:54] <jayme>	 btullis: agreed. Can you double check that datahub is also working as expected in codfw? :)
[12:48:11] <jayme>	 claime: but still strange that it works for a container running in staging-eqiad
[12:48:19] <jayme>	 but not in staging-codfw
[12:48:20] <claime>	 It's dumping less logs ?
[12:48:28] <jayme>	 it's the exat same thing
[12:48:31] <jayme>	 *exact
[12:49:36] <claime>	 :/
[12:59:32] <jayme>	 it does not happen for other containers in the cluster or for other containers in the datahub pod as well...so maybe it's an in-container-limit that get's hit.
[12:59:42] <jayme>	 I'm very much inclined to ignore this
[13:01:01] <btullis>	 jayme: Struggling a bit. As foolish as it sounds, I never really got around to setting up a decent testing mechanism against staging-eqiad. I'm trying an SSH tunnel like this, but not getting through to the front end: `ssh -N -L 8501:10.2.1.69:8501 kubestagemaster2001.codfw.wmnet`
[13:01:56] <jayme>	 btullis: isn't that thing behind ingress?
[13:02:32] <btullis>	 Yeah, I think so. I got the ingress IP from here: k8s-ingress-staging.svc.codfw.wmnet https://netbox.wikimedia.org/ipam/ip-addresses/10019/
[13:02:38] <btullis>	 Have I got the wrong port?
[13:03:35] <btullis>	 istio still throws me at bit, sorry.
[13:05:07] <jayme>	 ignore what I said about logs --follow above, I did the wrong thing. Seems like it happens for all containers on kubestage2001...I remember there was an issue *a long time ago* of kubelet not closing inotify watchers...
[13:06:00] <claime>	 jayme: Yep, that's what it made me think of
[13:06:02] <jayme>	 ah, you actually need to reach that thing with a browser right
[13:07:17] <btullis>	 jayme: Yeah, that's the only functional test we have so far other than "does it stay up"? Trying again with port 30443, but I'm not sure how I'm going to frig the servername.
[13:08:04] <jayme>	 30443 is the right port. But you will have to trick your browser into sending the right servername for sni
[13:09:34] <btullis>	 OK, thanks. I'll dig out an extension. How about which host I should use as SSH tunnel endpoint? I was trying kubestagemaster2001 - Should that be OK?
[13:11:40] <jayme>	 I'd use the deploy host, just to be outside the k8s context...
[13:11:50] <jayme>	 but anything should work really
[13:12:25] <jayme>	 root@kubestage2001:~# systemctl restart kubelet.service 
[13:12:26] <jayme>	 Failed to allocate directory watch: Too many open files
[13:12:31] <jayme>	 this looks promising :D
[13:13:03] <claime>	 Yeah we need to raise the inotify limits imo
[13:20:44] <jayme>	 yes. I'll add a patch raising those to what we use on prometheus nodes (in absense of an idea for proper numbers)
[13:20:56] <jayme>	                 'fs.inotify.max_user_watches'   => 32768,
[13:20:57] <jayme>	                 'fs.inotify.max_user_instances' => 512
[13:22:51] <claime>	 We can try that, and bump up if it's not enough
[13:26:05] <jayme>	 btullis: I've raised the max_user_instances temporarily on kubestage2001. logs --follow for datahub works again
[13:26:06] <claime>	 There's no real issue with raising these limits to like 8k instances 1M watches (I've done it before), it's 1kB per *used* watch out of kernel memory
[13:46:25] <btullis>	 jayme: Excellent, thanks. Also confirmed working from browser, with a frigged local `/etc/hosts`
[13:46:29] <btullis>	 https://usercontent.irccloud-cdn.com/file/PybUeVSj/image.png
[13:56:37] <jayme>	 👌
[13:56:53] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update staging-codfw to k8s 1.23 - https://phabricator.wikimedia.org/T326340 (10BTullis)
[14:39:50] <btullis>	 Deploying the new datahub containers to wikikube, for the record.
[14:39:57] <jayme>	 cool, thanks
[16:27:47] <jayme>	 btullis: I'm just going over other jre based containers that would potentially need a version bump to work with k8s 1.23 clusters and found https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/864770
[16:27:57] <jayme>	 looks like that has been forgotten maybe?
[16:29:45] <btullis>	 jayme: Yes, I probably forgot to submit it. I was hoping to ping you about spark next week, all things being well :-)
[16:31:50] <btullis>	 It should be OK just to merge that an build, right? (On Monday'ish) It's not going to impact anyone else is it? 
[16:33:34] <jayme>	 yeah, that should be fine. It would only impact users of the :latest tag for that image...
[16:33:45] <jayme>	 good luck with pinging me about spark :-p
[16:35:45] <jayme>	 srsly...sorry that it took so long. :/ I hope I can get back to reviewing it after staging-eqiad is on 1.23
[16:43:50] <btullis>	 Ah, it's fine. It's not like either of us were twiddling our thumbs.
[19:34:00] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10SRE-tools: Release httpbb 0.0.2 - https://phabricator.wikimedia.org/T328162 (10RLazarus)
[19:34:20] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10SRE-tools: Release httpbb 0.0.2 - https://phabricator.wikimedia.org/T328162 (10RLazarus) 05Open→03In progress p:05Triage→03Medium
[19:34:50] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10SRE-tools: Release httpbb 0.0.2 - https://phabricator.wikimedia.org/T328162 (10RLazarus)
[19:35:04] <wikibugs>	 10serviceops: httpbb shouldn't alert when large pages are occasionally slow - https://phabricator.wikimedia.org/T323707 (10RLazarus)
[20:13:39] <wikibugs>	 10serviceops, 10Data-Engineering, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10JArguello-WMF) 05Open→03Resolved
[20:14:51] <wikibugs>	 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 07): k8s deployment-charts mesh module should allow use of mesh without public_port Service - https://phabricator.wikimedia.org/T326252 (10JArguello-WMF) 05Open→03Resolved
[20:15:01] <wikibugs>	 10serviceops, 10Data-Engineering, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10JArguello-WMF)
[20:15:11] <wikibugs>	 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 07): k8s deployment-charts mesh module should allow use of mesh without public_port Service - https://phabricator.wikimedia.org/T326252 (10JArguello-WMF) 05Resolved→03Open
[20:15:18] <wikibugs>	 10serviceops, 10Data-Engineering, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10JArguello-WMF)
[20:16:05] <wikibugs>	 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream: k8s deployment-charts mesh module should allow use of mesh without public_port Service - https://phabricator.wikimedia.org/T326252 (10JArguello-WMF)
[23:11:13] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10SRE-tools: Release httpbb 0.0.2 - https://phabricator.wikimedia.org/T328162 (10RLazarus) 05In progress→03Resolved
[23:11:26] <wikibugs>	 10serviceops: httpbb shouldn't alert when large pages are occasionally slow - https://phabricator.wikimedia.org/T323707 (10RLazarus)