[00:12:15] 10serviceops, 10Generated Data Platform, 10Image-Suggestions, 10SRE, and 3 others: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10Dzahn) >>! In T304891#7823942, @JMeybohm wrote: > We still have those in labs/private `hieradata/common/profile... [00:12:55] 10serviceops, 10Generated Data Platform, 10Image-Suggestions, 10SRE, and 3 others: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10Dzahn) >>! In T304891#7823946, @Joe wrote: > * The deployment will be called image-suggestion and use the image... [09:26:34] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10SRE-Access-Requests: allow certain users to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Volans) p:05Triage→03Medium [09:30:31] hello, when you have a moment could you please set the priority of the T306162 task, thanks [10:41:37] Hello. I'm looking into https://phabricator.wikimedia.org/T306181 and trying to find out whether the eventgate-analytics-external service on wikikube is sufficiently resourced in terms of memory and CPU. [10:42:31] Are there any tips that might help with this, given that we don't have `kubectl top` available? Thanks. [10:49:09] I'm seeing a few restarts with OOMKilled being stated as the reason when I look at `kubectl describe pod $podname` [11:44:58] I've found this in Grafana: https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=eqiad%20prometheus%2Fk8s&var-namespace=eventgate-analytics-external&var-pod=eventgate-analytics-external-production-764cbd57b7-h4g2m&var-container=All [12:43:14] btullis: I would have also used the Kubernetes Container Details dashboard. From what I see you have plenty of cpu left but memory limits are a bit tight sometimes. Yesterday a eventgate-analytics-external pod hit max memory and was OOMKilled: [12:43:14] https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=eqiad%20prometheus%2Fk8s&var-namespace=eventgate-analytics-external&var-pod=eventgate-analytics-external-production-764cbd57b7-pfwzz&var-container=All&from=1650487563186&to=1650492535053 [12:43:14] I'm not sure if infrequent OOMKills are an issue for the specific service and this means data loss/no response :) [12:46:35] All pods in the namespace have total of 10 restarts in the last 30 days. So this is quite infrequent, to give you some context for the linked task above about backend errors [14:11:52] jelto: many thanks for this. Yeah, 10 restarts in 30 days doesn't sound like it's frequent enough to be causing either the errors (T306181) or the latency for p90 (T294911) . I've made a CR to increase it anyway though: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/785151 [21:08:00] 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (Incident Followup), 10User-jijiki: Avoid php-opcache corruption in WMF production - https://phabricator.wikimedia.org/T253673 (10Krinkle) [22:30:29] 10serviceops, 10GitLab (CI & Job Runners), 10Patch-For-Review: upgrade gitlab-runners to bullseye - https://phabricator.wikimedia.org/T297659 (10Dzahn) [22:46:54] 10serviceops, 10GitLab (CI & Job Runners), 10Patch-For-Review: upgrade gitlab-runners to bullseye - https://phabricator.wikimedia.org/T297659 (10Dzahn) Had some trouble with privileges for the non-privileged user, apparmor (which is installed by default on bullseye but without the userspace utils etc. See ab... [22:47:26] 10serviceops, 10GitLab (CI & Job Runners), 10Patch-For-Review: upgrade gitlab-runners to bullseye - https://phabricator.wikimedia.org/T297659 (10Dzahn) 05In progress→03Resolved