[09:18:20] Morning :0 [09:18:43] akosiaris: Did I miss anything interesting monday and yesterday? [09:46:06] claime: I don't think so. [09:46:37] there is 1 question that crept up btw regarding mw-on-k8s [09:47:00] Is there anything we know we’re waiting on from another team (e.g. releng) ? [09:47:07] Just making sure we have communicated it [09:47:37] Not really yet, I've asked them to start testing scap deploys on mw-debug and it apparently works [09:47:50] Now I have to stand up the rest of the services [09:48:30] I can put videoscalers aside if needed, I just wanted to have everything ready (re: your comment on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/850095/) [09:49:00] Users are already created though [09:49:51] that's lower visibility, I guess it's fine. namespaces tend to be a more clear indication of something being deployed or not [09:50:01] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Import istio 1.1x (k8s 1.23 dependency) - https://phabricator.wikimedia.org/T322193 (10elukey) [09:50:35] I 'd say that for now we can try to clearly communicate that videoscalers aren't gonna be moved to k8s [09:50:48] and attack that problem later [09:53:11] ack [09:55:09] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) [09:55:49] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Create mw-videoscaler helmfile deployment - https://phabricator.wikimedia.org/T321899 (10Clement_Goubert) 05Openβ†’03Stalled Following https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/850095/comment/af29135f_66a53696/ We still hav... [09:56:25] Moved the task to backlog and marked as stalled for now [09:56:42] I'll change the CR as soon as I have caught up on emails [10:32:58] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Migrate from command line flags to config files for kubernetes components - https://phabricator.wikimedia.org/T300499 (10JMeybohm) [10:34:34] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Migrate from command line flags to config files for kubernetes components - https://phabricator.wikimedia.org/T300499 (10JMeybohm) [11:13:50] I see a bunch of failures collecting metrics from all k8s clusters btw, starting at around 10:15 [11:14:03] https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets?orgId=1&var-datasource=thanos&var-Filters=job%7C%3D~%7C.*k8s.* [11:14:04] akosiaris: I was wrong, I thought usernames had been merged in hiera but not yet https://gerrit.wikimedia.org/r/c/operations/puppet/+/850094 [11:14:28] my money is on Ieb905505b being the culprit, cc jayme [11:14:52] looking [11:16:56] or if you want the prometheus' side of the story: https://prometheus-eqiad.wikimedia.org/k8s/classic/targets [11:17:00] and hit "unhealthy" [11:18:20] okaaay...so this is what we are using the unauthenticated read-only port for 😬 [11:18:58] hehee [11:19:24] lol [11:23:31] godog: is it desired that the prometheus page is cached? [11:24:00] jayme: not really no, it should be pass-through [11:24:08] it's not :) [11:27:04] heh you are right, I don't know what the magic bits are atm, leaving a note on T301944 for herron [11:27:13] https://gerrit.wikimedia.org/r/c/operations/puppet/+/852150 if you have a second [11:27:45] lol re: the comment change [11:27:56] +1 [11:28:05] <3 [11:30:15] merged and puppet running, should be back to normal in a couple of minutes. Thanks for catching and calling this out so quickly godog! [11:31:09] 10serviceops, 10SRE, 10Traffic, 10Abstract Wikipedia team (Phase Ξ» – Launch), and 2 others: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Vgutierrez) DCs using the Let's Encrypt cert have the wikifunctions... [11:31:25] np jayme, I noticed the jobunavailable alerts and that wasn't unexpected [11:32:45] ah, I now see it fooled me because it's grouped with the aux cluster one [11:33:16] ...I attributed it to just aux and ignored it :-/ [11:34:02] yeah I can see how that'd happen, not the most intuitive ATM [11:34:08] going to lunch, bbiab [11:35:28] well...the "firing: (17)" should have caught my interest I suppose πŸ˜‡ [11:41:41] 10serviceops, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Thumbnails on beta cluster return 503 Service Unavailable - https://phabricator.wikimedia.org/T321654 (10jbond) [11:43:49] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10puppet-compiler, 10ARM support: SRE Summit 2022 Outcome of Session "Adoption of aarch64 (aka arm64) in WMF production?" - https://phabricator.wikimedia.org/T320811 (10jbond) [11:44:18] okay, all green again. Will leave for lunch as well [14:58:07] 10serviceops, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Thumbnails on beta cluster return 503 Service Unavailable - https://phabricator.wikimedia.org/T321654 (10AlexisJazz) The message has changed: >Unauthorized >This server could not verify that you are authorized to access the document y... [15:30:11] 10serviceops, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Thumbnails on beta cluster return 503 Service Unavailable - https://phabricator.wikimedia.org/T321654 (10Vgutierrez) that message is provided by deployment-ms-fe03, I've captured one of my requests between ats-be in deployment-cache-up... [15:49:13] 10serviceops, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Thumbnails on beta cluster return 503 Service Unavailable - https://phabricator.wikimedia.org/T321654 (10Vgutierrez) digging a little bit on swift logs: ` Nov 2 15:40:41 deployment-ms-fe03 proxy-server: ERROR with Account server 172.1... [15:53:41] 10serviceops, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Thumbnails on beta cluster return 503 Service Unavailable - https://phabricator.wikimedia.org/T321654 (10Vgutierrez) well... deployment-ms-be05 and deployment-ms-be06 have been powered off.. I'm assuming because those two are running d... [16:01:44] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10LSobanski) @Cmjohnson what's the expected ETA for this host? Asking as contint1001 seems to be nearing the end of its life and we'd like to move ahead with the replacement as quick a... [16:31:22] if anyone has a sec, a quick little resource bump for thumbor's deployment (but also kask which will fail to be deployed in future if this isn't changed) https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/852240 [16:36:57] +1ed. Although I don't think it's an issue. We set a minimum memory requirement of 100M cluster wide. IIRC if you specify something below that, you'll get 100M [16:38:26] +1 with small nit, but too late lol [16:38:31] (it really is a nit) [16:38:48] sensible nit :) [16:39:19] jayme: I just had a failure: `Error creating: pods "thumbor-main-747bcffbb5-z4g9z" is forbidden: [minimum memory usage per Container is 100Mi, but request is 50Mi, maximum memory usage per Pod is 5Gi, but limit is 5683281920]` [16:40:46] It's not even giving you the minimum and refusing creation then? That's harsh [16:41:01] Makes some amount of sense, but still [16:55:47] 10serviceops, 10Analytics-Radar, 10Machine-Learning-Team: Using docker in WMF production network outside of kubernetes - https://phabricator.wikimedia.org/T275551 (10jbond) [17:00:19] 10serviceops, 10SRE, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10Dzahn) [17:02:20] 10serviceops, 10SRE, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10Dzahn) @LSobanski comment about the incident with contint1001 is at T294276#8357385 [17:03:08] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Dzahn) [17:03:26] 10serviceops, 10SRE, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10Dzahn) I think this is currently blocked on T313830. [17:11:26] 10serviceops, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Thumbnails on beta cluster return 503 Service Unavailable - https://phabricator.wikimedia.org/T321654 (10Vgutierrez) [17:25:04] 10serviceops, 10Release-Engineering-Team, 10Patch-For-Review: PendingDeprecationWarning on update_version.py - https://phabricator.wikimedia.org/T310133 (10TheresNoTime) a:05TheresNoTimeβ†’03None [18:23:31] 10serviceops, 10Phabricator, 10serviceops-collab, 10Patch-For-Review, 10Release-Engineering-Team (Bonus Level πŸ•ΉοΈ): move phabricator to new hardware generation - https://phabricator.wikimedia.org/T280597 (10Dzahn) [18:40:39] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Jclark-ctr) arclamp1001 B1 U40 cableID 23000021 port40 [18:40:53] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Jclark-ctr) [18:41:51] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Jclark-ctr) a:05Jclark-ctrβ†’03Cmjohnson [18:48:30] 10serviceops, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Thumbnails on beta cluster return 503 Service Unavailable - https://phabricator.wikimedia.org/T321654 (10TheresNoTime) p:05Triageβ†’03High Bumping this one //up// a bit as it's broken our testing of [[ https://en.wikipedia.beta.wmfla... [20:26:25] 10serviceops, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Thumbnails on beta cluster return 503 Service Unavailable - https://phabricator.wikimedia.org/T321654 (10Vgutierrez) @TheresNoTime this should be fixed as a side effect of powering the old instances on to be able to add the new instanc... [20:28:57] 10serviceops, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Thumbnails on beta cluster return 503 Service Unavailable - https://phabricator.wikimedia.org/T321654 (10TheresNoTime) >>! In T321654#8364815, @Vgutierrez wrote: > @TheresNoTime this should be fixed as a side effect of powering the old... [20:29:33] 10serviceops, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Thumbnails on beta cluster return 503 Service Unavailable - https://phabricator.wikimedia.org/T321654 (10Vgutierrez) 05Openβ†’03Resolved a:03Vgutierrez