[09:28:05] Mornin' [10:10:03] https://istio.io/latest/blog/2022/istio-has-applied-to-join-the-cncf/ [10:12:29] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Import istio 1.1x (k8s 1.23 dependency) - https://phabricator.wikimedia.org/T322193 (10elukey) Noteworthy changelogs to read: * https://istio.io/latest/news/releases/1.10.x/announcing-1.10/ ** https://istio.io/latest/blog/2021/discovery-selectors/ * https://ist... [10:12:54] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Import istio 1.1x (k8s 1.23 dependency) - https://phabricator.wikimedia.org/T322193 (10elukey) [10:46:11] Morning all. I'm hoping to merge this change to the spark and spark-operator images today, if possible: https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/850244 [10:46:11] Crucially, I have *not* implemented the upstream method of using root privileges to write files into each executor container. I have also included custom entrypoint scripts to remove the nasty bits, as previously advised. Thanks. [11:03:18] 10serviceops, 10MW-on-K8s: Allow absenting profile::kubernetes::deployment_server::services - https://phabricator.wikimedia.org/T322298 (10Clement_Goubert) [11:56:41] 10serviceops, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Thumbnails on beta cluster return 503 Service Unavailable - https://phabricator.wikimedia.org/T321654 (10Vgutierrez) [12:16:19] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Deploy new mw-debug service - https://phabricator.wikimedia.org/T321201 (10Clement_Goubert) In order to cleanup : # Remove mwdebug private puppet resources in `profile::kubernetes::infrastructure_user` and `profile::kubernetes::deployment_server_secrets::service... [12:21:29] If anyone has the time to do a round of the above task comment and attached CRs so I can make sure I haven't forgotten obvious things that'd be much appreciated [12:21:42] (pairing optional :p) [12:45:19] * claime lunch [13:08:23] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Migrate from command line flags to config files for kubernetes components - https://phabricator.wikimedia.org/T300499 (10JMeybohm) [13:09:44] 10serviceops, 10Continuous-Integration-Infrastructure, 10SRE, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10hashar) [13:15:30] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10hashar) Note the contint machines require a public IPv4 address in order to be able to reach out WMCS instances. Currently we have: | fqdn | IPv4 |--|-- | contint1001.wikimedia.org... [13:17:28] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Migrate from command line flags to config files for kubernetes components - https://phabricator.wikimedia.org/T300499 (10JMeybohm) [13:39:38] 10serviceops, 10ContentTranslation, 10Machine-Learning-Team, 10Wikimedia Enterprise: Run NLLB-200 model in a new instance - https://phabricator.wikimedia.org/T321781 (10LSobanski) Thanks for the clarification. Let's start with #serviceops then and see who else we need afterwards. [13:57:32] 10serviceops, 10Discovery-Search, 10SRE, 10serviceops-collab, and 2 others: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Clement_Goubert) I have checked with traffic, and we can effectively start by removing the trafficserver mapping via https://gerrit.wikimedia.org/r/... [14:05:41] akosiaris: If I retire search.wikimedia.org from trafficserver today, is that ok with you? I figure we start a one or two week grace period from that time before actually proceeding to the service decom? Phab https://phabricator.wikimedia.org/T316296 and CR https://gerrit.wikimedia.org/r/c/operations/puppet/+/826884 for reference [14:05:49] 10serviceops, 10SRE, 10Traffic, 10Abstract Wikipedia team (Phase λ – Launch), and 2 others: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Vgutierrez) [14:05:58] claime: +1 [14:06:11] akosiaris: cool, thx [14:06:49] I imagine logging to SAL that I'm sunsetting the domain is the right thing to do [14:14:37] 10serviceops, 10Discovery-Search, 10SRE, 10serviceops-collab, and 2 others: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Clement_Goubert) Starting 2 week grace period from today, full decom to happen after 2022-11-17 [14:21:28] 10serviceops, 10Discovery-Search, 10SRE, 10serviceops-collab, and 2 others: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Clement_Goubert) 05Open→03In progress [14:22:20] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Allow absenting profile::kubernetes::deployment_server::services - https://phabricator.wikimedia.org/T322298 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium [14:22:24] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Deploy new mw-debug service - https://phabricator.wikimedia.org/T321201 (10Clement_Goubert) [14:25:18] 10serviceops, 10Observability-Tracing: OpenTelemetry Collector puppetized and able to be deployed easily to arbitrary roles - https://phabricator.wikimedia.org/T320565 (10Clement_Goubert) 05Open→03In progress [14:25:20] 10serviceops, 10Observability-Tracing: Package OpenTelemetry Collector as a .deb - https://phabricator.wikimedia.org/T320551 (10Clement_Goubert) [15:11:01] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Import istio 1.1x (k8s 1.23 dependency) - https://phabricator.wikimedia.org/T322193 (10elukey) [15:26:35] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Migrate from command line flags to config files for kubernetes components - https://phabricator.wikimedia.org/T300499 (10JMeybohm) [16:08:05] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Import istio 1.1x (k8s 1.23 dependency) - https://phabricator.wikimedia.org/T322193 (10elukey) [16:25:27] Yet more minor fixes/tweaks for thumbor https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/852953/ [16:25:52] I think we're gonna have issues with the per-pod memory limit if we scale up the number of instances in a single pod, but that's not a bridge that needs to be crossed right now [16:29:49] hnowlan: You may want to add your service port to https://wikitech.wikimedia.org/wiki/Kubernetes/Service_ports I think [16:32:23] claime: I've been holding off on that because of the question of port to be used - I think it's somewhat unavoidable we'll have to use 8800 though for the mixed-mode migration to work [16:32:45] hnowlan: ack [16:32:45] for now there's no harm in updating it though, good point [16:33:04] Yeah just so someone doesn't try to deploy on that port too [16:34:43] done [16:34:51] <3 [18:16:40] 10serviceops, 10Dumps-Generation, 10SRE: conf* host ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jcrespo) [18:17:04] 10serviceops, 10Dumps-Generation, 10SRE, 10Wikimedia-Incident: conf* host ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jcrespo) [18:20:16] 10serviceops, 10Dumps-Generation, 10SRE, 10Wikimedia-Incident: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jcrespo) [18:22:16] 10serviceops, 10Dumps-Generation, 10SRE, 10Wikimedia-Incident: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jcrespo) [18:31:07] 10serviceops, 10Dumps-Generation, 10SRE, 10Patch-For-Review, 10Wikimedia-Incident: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jcrespo) [18:55:54] 10serviceops, 10ContentTranslation, 10Machine-Learning-Team, 10Wikimedia Enterprise: Run NLLB-200 model in a new instance - https://phabricator.wikimedia.org/T321781 (10calbon) I am moving this ticket to ML's in progress column. @klausman I spoke to Deb. It sounds like the plan currently is for you to do t... [19:52:48] 10serviceops, 10Dumps-Generation, 10SRE, 10MW-1.40-notes (1.40.0-wmf.10; 2022-11-14), and 2 others: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10elukey) Not sure if in scope with the task, but we should add monitoring to a metric like https://grafana.wiki... [20:57:05] 10serviceops, 10Dumps-Generation, 10SRE, 10MW-1.40-notes (1.40.0-wmf.10; 2022-11-14), and 2 others: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jcrespo) >>! In T322360#8368215, @elukey wrote: > Not sure if in scope with the task, but we should add monito... [20:57:18] 10serviceops, 10Dumps-Generation, 10SRE, 10MW-1.40-notes (1.40.0-wmf.10; 2022-11-14), and 2 others: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jcrespo) [21:01:48] 10serviceops, 10Dumps-Generation, 10SRE, 10MW-1.40-notes (1.40.0-wmf.10; 2022-11-14), and 2 others: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jcrespo) Adding @daniel as I believe this was the problematic patch, 5b0b54599bfd, but I am not 100% sure, bec... [21:09:32] 10serviceops, 10Dumps-Generation, 10SRE, 10MW-1.40-notes (1.40.0-wmf.10; 2022-11-14), and 2 others: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10ArielGlenn) There was a config setting that turned it on for November. See https://gerrit.wikimedia.org/r/c/op... [21:28:57] 10serviceops, 10Dumps-Generation, 10SRE, 10MW-1.40-notes (1.40.0-wmf.10; 2022-11-14), and 2 others: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jcrespo) Thank you, Ariel!