[08:33:13] 06serviceops, 07Datacenter-Switchover, 13Patch-For-Review: sre.discovery.datacenter should handle depooled authdns hosts - https://phabricator.wikimedia.org/T375285#10170238 (10Volans) If this is not super urgent, do you think it could wait an "upstream" solution in spicerack as discussed in T375014? [09:28:28] 06serviceops, 10MW-on-K8s, 10Scap: Helm deployment timeouts during train presync - https://phabricator.wikimedia.org/T375477 (10jnuche) 03NEW [09:32:54] 06serviceops, 10MW-on-K8s, 10Scap: Helm deployment timeouts during train presync - https://phabricator.wikimedia.org/T375477#10170425 (10jnuche) Note that the main stage that deploys to K8s normally takes around 6 minutes. Assuming a factor 10 in the time increase, that would mean an hour just to deploy that... [09:33:34] 06serviceops, 10Observability-Alerting, 10SRE Observability (FY2024/2025-Q1): Retire mw_wikiversion_difference check - https://phabricator.wikimedia.org/T374860#10170426 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Nice! Thank you @hnowlan, resolving as we're done [09:43:26] 06serviceops, 10MW-on-K8s, 10Scap: Helm deployment timeouts during train presync - https://phabricator.wikimedia.org/T375477#10170459 (10akosiaris) > This increase in deployment times coincides with the deployment to production of the following scap change: https://gitlab.wikimedia.org/repos/releng/scap/-/me... [09:51:25] 06serviceops, 10MW-on-K8s, 10Scap: Helm deployment timeouts during train presync - https://phabricator.wikimedia.org/T375477#10170512 (10akosiaris) {F57533835} I see some evictions happening during the deployment that could explain this, trying to correlate. [09:56:58] 06serviceops, 10MW-on-K8s, 06Release-Engineering-Team, 10Scap: Pushing mediawiki-multiversion Docker image from deploy server takes 4 minutes - https://phabricator.wikimedia.org/T341441#10170569 (10hashar) If compression / CPU is a bottleneck, and assume the images layers are already compressed, I imagine... [10:01:47] 06serviceops, 10MW-on-K8s, 10Scap: Helm deployment timeouts during train presync - https://phabricator.wikimedia.org/T375477#10170585 (10jnuche) It seems timeouts are currently hardcoded: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/mw-ap... [10:02:18] 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10170587 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by jiji@cumin1002 from mw2426 to wikikube-worker2126 completed: - mw2426 (**PASS**) - ✔️... [10:05:32] 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10170588 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by jiji@cumin1002 Renumbering for host wikikube-worker2126.codfw.wmnet [10:05:53] 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10170589 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1002 for host wikikube-worker2126.codfw.wmnet with OS bullseye [10:07:51] 06serviceops, 10MW-on-K8s, 10Scap: Helm deployment timeouts during train presync - https://phabricator.wikimedia.org/T375477#10170591 (10akosiaris) https://logstash.wikimedia.org/goto/69fa724990f8f554ac97601360675c79 points out the the slowest image pull was at 2m27s, most were way faster. webserver ones at... [10:08:31] jnuche: wanna try once more a deployment ? [10:08:42] I am trying to see why that would break [10:10:30] akosiaris: restarted, full image building normally takes a few minutes, so it will be a bit before it gets to the deployment stage [10:10:56] cool, thanks [10:31:50] canaries started [10:32:21] cool [10:34:36] 06serviceops, 10MW-on-K8s, 10Scap: Helm deployment timeouts during train presync - https://phabricator.wikimedia.org/T375477#10170629 (10akosiaris) >>! In T375477#10170585, @jnuche wrote: > It seems timeouts are currently hardcoded: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/... [10:42:00] timed out at the canaries, it's starting the rollback now [10:45:14] ok, I 'll craft the change and upload it to restart all canaries together [10:46:05] ack, thx [10:57:26] 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10170697 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1002 for host wikikube-worker2126.codfw.wmnet with OS bullseye completed: -... [11:38:59] 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10170828 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by jiji@cumin1002 Renumbering for host wikikube-worker2126.codfw.wmnet completed: - w... [11:44:10] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migration to containerd and away from docker - https://phabricator.wikimedia.org/T362408#10170848 (10JMeybohm) [12:13:41] 06serviceops, 06Data-Engineering, 06Data-Platform-SRE, 06SRE, 10Event-Platform: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058#10170935 (10JMeybohm) [12:13:42] 06serviceops, 10Prod-Kubernetes, 03Discovery-Search (Current work), 07Kubernetes: Use kafka-main-[eqiad|codfw].external-services.svc.cluster.local to discover kafka brokers in kafka client running in k8s - https://phabricator.wikimedia.org/T374729#10170936 (10JMeybohm) [12:17:53] 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10170952 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by jiji@cumin1002 from mw2427 to wikikube-worker2127 completed: - mw2427 (**WARN**) - ✔️... [12:22:23] 06serviceops, 10MW-on-K8s, 10Scap: Helm deployment timeouts during train presync - https://phabricator.wikimedia.org/T375477#10170975 (10akosiaris) Adding 1 more data point. In the previous deployment I see also ` Liveness probe failed: command "/usr/bin/test -S /run/shared/fpm-www.sock" timed out ` for... [12:25:13] hi folks! [12:25:23] going to deploy tegola for https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1073818 [12:28:38] jnuche: I also found a state management bug in scap (it doesn't also rollback the -main versions, filing a task) [12:29:01] ruling out btw that something is indeed wrong with the deployment [12:29:12] more like I am still trying to rule out* [12:30:11] 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10171010 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by jiji@cumin1002 from mw2427 to wikikube-worker2127 completed: - mw2427 (**PASS**) - ✔️... [12:30:20] akosiaris: ack, thanks a lot for following up [12:32:31] 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10171026 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by jiji@cumin1002 Renumbering for host wikikube-worker2127.codfw.wmnet [12:32:55] 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10171031 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1002 for host wikikube-worker2127.codfw.wmnet with OS bullseye [12:34:37] task here: https://phabricator.wikimedia.org/T375497 [12:43:50] 06serviceops, 10MoveComms-Support, 07Datacenter-Switchover: MoveComms support for Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T371130#10171171 (10Trizek-WMF) [12:55:02] jnuche: patch up at https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1075221, asking the team for objections right now [13:08:00] jnuche patch merged, you should be good to go [13:09:10] akosiaris: awesome, thx. Backport window is currently happening, will try again once they're done [13:09:18] 06serviceops, 10MW-on-K8s, 10Scap: Evaluate the performance improvements brought in by prefetching MW images on WikiKube hosts - https://phabricator.wikimedia.org/T366778#10171326 (10akosiaris) [13:10:30] jnuche: saw it. It already picked up the change. Lucky timing [15:38:50] 06serviceops, 10MW-on-K8s, 10Release-Engineering-Team (Priority Backlog 📥): Provide an mwdebug functionality on kubernetes - https://phabricator.wikimedia.org/T276994#10172077 (10Krinkle) [15:38:57] 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#10172078 (10Krinkle) [15:49:21] 06serviceops, 07Datacenter-Switchover, 13Patch-For-Review: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962#10172134 (10ops-monitoring-bot) swfrench@cumin1002 - Cookbook cookbooks.sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover - T37096... [16:12:24] !log Deploying Refinery [16:13:26] Sorry! Wrong place [16:19:20] 06serviceops, 06DC-Ops, 10ops-codfw, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes mw2424 and mw2425 - https://phabricator.wikimedia.org/T375398#10172251 (10Jhancock.wm) a:03Jhancock.wm wrong ticket [17:30:48] 06serviceops, 07Datacenter-Switchover, 13Patch-For-Review: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962#10172473 (10Scott_French) The `swift-https` service has three associated discovery services: * `swift` (A/A) * `swift-rw` (A/A) * `swift-ro` (A/P) This is i... [17:37:29] 06serviceops, 10Prod-Kubernetes, 06Traffic, 07Kubernetes: Reverse DNS for k8s pods IPs - https://phabricator.wikimedia.org/T344171#10172491 (10CDanis) As discussed at the k8s SIG today: * There were some doubts about if Calico can advertise just a subset of Service ClusterIP range, or if it was all-or-not... [17:56:19] 06serviceops, 07Datacenter-Switchover, 13Patch-For-Review: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962#10172531 (10Scott_French) I've opened T375544 to investigate the logstash issue referenced in T370962#10172304. [19:13:55] 06serviceops, 07Datacenter-Switchover, 13Patch-For-Review: sre.discovery.datacenter should handle depooled authdns hosts - https://phabricator.wikimedia.org/T375285#10172796 (10Scott_French) @Volans - Yes, in fact that would be ideal. I went ahead and drafted https://gerrit.wikimedia.org/r/1074551 mainly to... [20:58:34] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q1:rack/setup/install mc-misc100[12] - https://phabricator.wikimedia.org/T371987#10173118 (10Jclark-ctr) @jijiki please do update to preseed.yaml, and site.pp when you can we have received these and can not move forward until that step is completed. [22:53:01] 06serviceops, 10Citoid: citoid having stability issues - https://phabricator.wikimedia.org/T330768#10173616 (10Jdforrester-WMF) Whilst poking around open alerts, I noticed that both [[https://alerts.wikimedia.org/?q=%40state%3Dactive&q=namespace%3Dcitoid|citoid]] and [[https://alerts.wikimedia.org/?q=%40state%...