[07:23:55] <_joe_> claime: 1) the "mwdebug" dashboard can now pick all mw on k8s deployments 2) both mw-web and mw-api-ext pass httpbb tests, I think we're GTG at 11 [08:36:21] _joe_: Ah I'd just changed it so it could pick up mw-debug, cool [08:40:36] 10serviceops, 10MW-on-K8s, 10Scap, 10Release-Engineering-Team (Radar): Deploy MediaWiki images for kubernetes from the deployment servers - https://phabricator.wikimedia.org/T302539 (10JMeybohm) [08:40:48] 10serviceops, 10MW-on-K8s, 10Kubernetes, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Kubernetes credentials on deployment servers should be available to deployers, not all users - https://phabricator.wikimedia.org/T305729 (10JMeybohm) 05Open→03Resolved AIUI this is done. [08:41:50] 10serviceops, 10MW-on-K8s, 10Release Pipeline: Pushes to docker-registry fail for images with compressed layers of size >1GB - https://phabricator.wikimedia.org/T288198 (10JMeybohm) a:05JMeybohm→03None Removing myself from assignee as I'm not currently working on this [08:42:26] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: New Kubernetes nodes may end up with no Pod IPv4 block assigned - https://phabricator.wikimedia.org/T296303 (10JMeybohm) [08:42:32] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [08:42:40] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: New Kubernetes nodes may end up with no Pod IPv4 block assigned - https://phabricator.wikimedia.org/T296303 (10JMeybohm) a:05JMeybohm→03None [09:13:02] 10serviceops, 10Data-Engineering, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10dcausse) >>! In T324576#8457473, @Ottomata wrote: > Q for @dcausse and @gmodena. > > I've thu... [09:54:51] _joe_: I'll go get a coffee and we can start :) [09:55:08] <_joe_> claime: in one hour ;) [09:58:25] UTC... [09:58:34] * claime headdesks [09:59:05] Welp more time to write my IR for Friday. [10:58:00] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10jijiki) [11:00:42] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10jijiki) [11:16:01] 10serviceops: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Clement_Goubert) [11:16:53] 10serviceops: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Clement_Goubert) Tagging other responders to help fill out the incident report. [12:01:48] 10serviceops, 10Data-Engineering, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10gmodena) >>! In T324576#8457473, @Ottomata wrote: > I think I'd prefer not to write log files... [12:46:52] 10serviceops, 10Data-Engineering, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10akosiaris) >>! In T324576#8463074, @gmodena wrote: >>>! In T324576#8457473, @Ottomata wrote: >... [14:03:14] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [14:03:55] 10serviceops, 10Machine-Learning-Team, 10Patch-For-Review: Fix calico, cfssl-issuer and knative-serving Helm dependencies - https://phabricator.wikimedia.org/T303279 (10JMeybohm) 05Open→03Resolved a:03JMeybohm [14:05:59] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Move away from system:node RBAC role - https://phabricator.wikimedia.org/T299236 (10JMeybohm) PKI prepared the way for this but admin_ng still needs to be adapted to no longer apply the system:node changes to 1.23 clusters [14:12:15] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Migrate charts away from deprecated typology annotations - https://phabricator.wikimedia.org/T325066 (10JMeybohm) [14:16:17] jayme: o/ the flink image and helm patches are pretty ready to go. I need to work more on the networking/istio bits of the flink app, and I have some things to try in minikube, but some of that might be easier to work out in k8s staging? [14:38:18] 10serviceops, 10SRE-OnFire, 10Patch-For-Review, 10Sustainability (Incident Followup): Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Clement_Goubert) [14:40:46] 10serviceops, 10SRE-OnFire, 10Patch-For-Review, 10Sustainability (Incident Followup): Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Clement_Goubert) Since this incident was caused by a temporary raise in logging volume, and our response was to scale up... [14:47:16] 10serviceops, 10SRE-OnFire, 10Sustainability (Incident Followup): Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068 (10Clement_Goubert) [14:49:41] 10serviceops, 10SRE-OnFire, 10Sustainability (Incident Followup): Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068 (10Clement_Goubert) p:05Triage→03Medium [14:50:09] great work folks on mw-on-k8s [14:50:13] :) [14:54:41] 10serviceops, 10Data-Engineering, 10SRE-OnFire, 10Sustainability (Incident Followup): Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068 (10Clement_Goubert) [14:55:05] 10serviceops, 10Data-Engineering, 10SRE-OnFire, 10Patch-For-Review, 10Sustainability (Incident Followup): Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Clement_Goubert) [14:59:03] 10serviceops, 10Data-Engineering, 10SRE-OnFire, 10Patch-For-Review, 10Sustainability (Incident Followup): Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Ottomata) > reverting to the state before the incident Hm, do we need to revert? I don't mind eith... [15:06:48] 10serviceops, 10Data-Engineering, 10SRE-OnFire, 10Patch-For-Review, 10Sustainability (Incident Followup): Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Clement_Goubert) >>! In T324994#8463585, @Ottomata wrote: >> reverting to the state before the inci... [15:19:16] I'd like to pool thumbor k8s for a few minutes in ~20 minutes. If that goes well I'll leave it pooled for an hour [15:24:26] huh, thumbor2004 isn't pooled for some reason https://config-master.wikimedia.org/pybal/codfw/thumbor [15:24:40] 10serviceops, 10MW-on-K8s: Better naming for mw-on-k8s pods - https://phabricator.wikimedia.org/T325071 (10Clement_Goubert) [15:25:31] 10serviceops, 10MW-on-K8s: Better naming for mw-on-k8s pods - https://phabricator.wikimedia.org/T325071 (10Clement_Goubert) p:05Triage→03Low [15:26:29] ottomata: Sorry, I did not managed to come around to review those charts by now. Not sure what you mean exactly by "networking/istio bits", but maybe we can discuss some of that on task? [15:28:10] 10serviceops, 10SRE: k8s/mw: traffic to eventgate dropped by iptables - https://phabricator.wikimedia.org/T249700 (10LSobanski) [15:52:26] 10serviceops, 10Infrastructure-Foundations, 10netbox: Netbox and Redis - https://phabricator.wikimedia.org/T311385 (10jijiki) @ayounsi If you wish to use our redis_misc cluster, you can assign a pair/port/db combination here: https://wikitech.wikimedia.org/wiki/Redis. We have 2 pairs (primary-secondary) in e... [15:55:13] We're serving some thumbor traffic from k8s, only 4 hosts and with a weight of 2 [15:55:21] \o/ [15:57:28] jayme: mostly I removed templates/vendor stuff that i didn't find necessary/useful in minikube yet. there are ingress/egress/mesh vendor templates that we probably need in prod k8s, but I don't understand them enough to incorporate them yet [16:01:53] ottomata: I can probably check on the charts end of this week. If you could gather requirements you have (like what egress do you need) on task, that would help [16:02:05] gtg got today, sorry [16:07:36] 10serviceops, 10Data-Persistence (work done), 10Parsoid (Tracking): nodejs can't connect to mysqld via tcp/localhost any longer (was: mariadb failing on testreduce1001) - https://phabricator.wikimedia.org/T274034 (10jijiki) @ssastry do you think we could close this task? [16:12:13] depooled thumbor - not sure that worked entirely well. Didn't see a lot of traffic hitting the hosts and some probes failed/were slow [16:12:17] k I will try, the specific egress needs will be per service specific though, will try to describe [16:12:30] ty [16:21:09] 10serviceops, 10Data-Engineering, 10SRE-OnFire, 10Patch-For-Review, 10Sustainability (Incident Followup): Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Ladsgroup) Hi, The flood of logs is still incoming, the revert of logspam has not been deployed yet... [16:32:48] riddle me this - I pool some of the kube thumbor hosts at low weight and it seems no requests at all seem to reach the pods: https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=eqiad%20prometheus%2Fk8s&var-namespace=thumbor&var-pod=All&from=now-1h&to=now not much of a bump on the service metrics either, but we start to get probe failures. no reason for [16:32:54] connectivity issues I can see given that we use lvs with kubesvc hosts all the time [16:33:11] something wrong with the probes in the service.yaml? Doesn't look like it to me as we implement and pass the same healthchecks [16:44:25] hnowlan: can you give us more info about where you see probe failures? I'd also check the thumbor access logs to see how many requests come in, the pod graphs may take more traffic to show up meaninful bumps [16:46:40] elukey: looks to me like there aren't any requests showing up in access logs :( [16:46:50] the probe failures are the generic ones https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes%2Fservice&var-module=All&orgId=1&from=now-30m&to=now [16:47:36] hnowlan: mmm does pybal logs show anything useful? [16:47:53] elukey: heh, I was having difficulty finding them [16:49:24] should be on lvs1019 in theory [16:50:35] hnowlan: also https://config-master.wikimedia.org/pybal/eqiad/thumbor show a lot of enabled: False [16:54:32] yeah, I disabled everything once things started failing [16:55:07] ah okok :) [16:56:35] hmm, no failures of note in the pybal logs [17:00:08] ok so the hosts were getting traffic from pybal, passing health checks but failing the prometheus probes? [17:03:18] yep! I at least didn't see any healthcheck failures, and the configured healthchecks pass on those hosts when queried directly. Although I'm not sure they were actually getting traffic from lvs/via thumbor.svc.codfw.wmnet service [17:18:29] ah so I was reading https://gerrit.wikimedia.org/r/c/operations/puppet/+/866445, this is a different set up from the rest of svcs [17:22:12] hnowlan: see https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-default-1-1.7.0-5-2022.50?id=DqEvDIUBDKoS4s0H7dsT [17:22:57] the time matches with the graph that you showed afaics [17:23:45] aha... interesting [17:24:10] I followed https://wikitech.wikimedia.org/wiki/Network_monitoring#Blackbox_Probes_(Prometheus) [17:24:17] the no route to host is a little weird though [17:25:43] (need to go now, hope it will lead to something :) [17:26:54] thanks for the help! [17:29:07] <_joe_> hnowlan: have you added the thumbor pool to profile::lvs::pools in hiera for the k8s workers? [17:29:28] <_joe_> if not, the host doesn't respond to the network packets for that ip it gets from lvs [17:32:46] _joe_: aha, I have not [17:33:13] <_joe_> it came to my mind last night when I went to bed, then I forgot to ask you if you did [17:33:20] heh [17:42:24] thanks for pointing that out! https://gerrit.wikimedia.org/r/867681 [17:42:35] for now it can wait until tomorrow, probably too late to be breaking things [17:43:40] 10serviceops, 10Data-Engineering, 10SRE-OnFire, 10Patch-For-Review, 10Sustainability (Incident Followup): Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Clement_Goubert) Yes, the commit message of the above changelog makes it very clear it is not to be... [17:55:30] 10serviceops, 10Wikimedia Enterprise, 10Performance-Team (Radar), 10affects-Kiwix-and-openZIM: large amount of traffic to the action=parse API from MWOffliner - https://phabricator.wikimedia.org/T324866 (10Tgr) 7000 rpm is about 300 million per month. Wikistats says we have about that many content-space pa... [18:01:55] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10colewhite) [18:30:08] 10serviceops, 10Data-Engineering, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) @JMeybohm re needed ingress and egress. **Ingress**: I don't think we //need// anyt... [19:14:56] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Dzahn) the state of parse1002 was manually changed in netbox from "active" to "failed" but there was no sync / cookbook run. This meant at next unrelated deco... [21:06:10] Oops [21:21:19] 10serviceops, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Release-Engineering-Team (Seen): switch contint prod server back from contint2001 to contint1001 - https://phabricator.wikimedia.org/T256422 (10Dzahn) [21:22:21] 10serviceops, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Release-Engineering-Team (Seen): switch contint prod server back from contint2001 to contint1001 - https://phabricator.wikimedia.org/T256422 (10Dzahn)