[07:14:24] 10serviceops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10ayounsi) 05Open→03Resolved a:03ayounsi [08:13:18] Hi Folks. I just noticed job/service Prometheus probes in codfw are flapping a bit since the upgrade to 1.23. It seems there are significantly more "context deadline exceeded" than before. I noticed this for miscweb, but other Kubernetes services are affected too: [08:13:19] https://logstash.wikimedia.org/goto/c445d50124df1dcd85739700a26fd9bc [08:13:19] I can also add this to the upgrade task, if needed. [08:33:49] jelto: hi! I think that we can add all the info the task so it is clear, I found another issue [08:33:53] https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&var-cluster=codfw%20prometheus%2Fk8s&var-namespace=All&var-backend=All&var-response_code=All&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99 [08:34:05] istio metrics may have changed, I don't see them in the gateway dashboard [08:34:15] (same thing happened for the sidecar ones, that only ml uses) [08:36:42] elukey: yeah makes sense. Do you mean T307943? [08:53:23] jelto: we can probably use https://phabricator.wikimedia.org/T329664 that is more specific [08:55:34] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10elukey) I noticed that the [[ https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&var-cluster=codfw%20prometheus%2Fk8s&var-namespace=All&var-backend=All&va... [08:57:25] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10elukey) [08:57:27] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10Jelto) I just noticed job/service Prometheus probes in codfw are flapping a bit since the upgrade to 1.23. It seems there are significantly more "context deadl... [08:57:40] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: K8s etcd on bullseye show TLS errors in logs - https://phabricator.wikimedia.org/T329556 (10elukey) 05Open→03Resolved a:03elukey [08:57:57] o/ [08:58:26] Got an APP meeting in 3m, please do add things under that task or create sub tasks and we 'll get on them on by one. [09:45:05] akosiaris, claime - I saw that https://phabricator.wikimedia.org/T330048 was resolved, should we reimage the remaining nodes? [09:45:16] I can try one and see if it now works [09:45:29] yeah sure, thank you [09:45:30] basically 2017 -> 2021 [09:45:38] ok trying 2017 [09:45:38] elukey: go ahead. I 've been meaning to do it if nobody else does [09:45:48] but 3 hours of meetings back to back [09:46:03] I 'll be able to help at ~2pm UTC ? [09:46:28] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2017.codfw.wmnet with OS bullseye [09:46:49] akosiaris: ack! [09:47:03] 10serviceops, 10Release-Engineering-Team: scap build-and-push-container-images failed due to backend fetch error - https://phabricator.wikimedia.org/T330264 (10hashar) 05Open→03Declined I have filed it to keep track of the issue in case it occurs again in the future. [09:47:06] claime: can I ping you if I need help in double checking the nodes after the reimages? [09:47:11] I'm gonna go shorten the TTLs for the live test at 1100 [09:47:25] elukey: I'm running the switchover live-test today, I may not be very available [09:47:58] I'll be around though, but not sure of how usefule [09:48:00] useful* [09:48:16] ah right no problem, at what time? [09:48:21] so I avoid messing up with you [09:48:30] 1100GMT [09:49:16] ah ok in an hour [09:49:18] But you won't be messing with me, I'm taking stuff out of codfw, and we already know A/A services move well [09:49:31] super so I can do the reimages [09:49:35] I'm focusing on DB/mw [09:49:40] yes, absolutely [09:49:44] <3 [10:04:14] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2018.codfw.wmnet with OS bullseye [10:04:34] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2019.codfw.wmnet with OS bullseye [10:05:02] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2020.codfw.wmnet with OS bullseye [10:05:28] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2021.codfw.wmnet with OS bullseye [10:05:41] started all the reimages, 2017 seems running fine [10:08:34] elukey: <3 [10:21:12] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2017.codfw.wmnet with OS bullseye executed with errors: - kubernetes2017 (**F... [10:25:24] 2017 up :) [10:26:47] We'll have to check something with eventstreams-internal, there are PyBal alerts for a bunch of kubernetes codfw nodes [10:33:09] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) [10:33:28] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) 05Open→03In progress p:05Triage→03High [10:33:38] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [10:36:11] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) [10:36:34] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) ` 10:35 <+logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl 10:35 <+logmsgbot> !log cgoubert@cu... [10:39:39] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2018.codfw.wmnet with OS bullseye executed with errors: - kubernetes2018 (**F... [10:41:04] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2020.codfw.wmnet with OS bullseye executed with errors: - kubernetes2020 (**F... [10:41:37] 10serviceops, 10Datacenter-Switchover: sre.switchdc.mediawiki cookbook should take a task-id argument - https://phabricator.wikimedia.org/T330273 (10Clement_Goubert) p:05Triage→03Medium [10:43:56] 2018 and 2020 up [10:45:13] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2021.codfw.wmnet with OS bullseye completed: - kubernetes2021 (**PASS**) -... [10:45:40] ok all nodes up [10:46:04] akosiaris, claime - I have cordoned them, so you can do a final check and add them to prod if all is ok [10:46:09] going to update the task as well [10:46:14] thanks! [10:46:21] thanks elukey <3 [10:46:31] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10elukey) [10:46:37] I'm going to take a small break before proceeding with the switchover live test [10:46:52] I'll be skipping the cache-warmup since it's broken anyways [10:47:04] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10elukey) The 2017->2021 nodes have been reimaged, and they are now cordoned to wait for ServiceOps' final check. [10:47:23] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2019.codfw.wmnet with OS bullseye completed: - kubernetes2019 (**PASS**) -... [10:48:21] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) Skipping `00-optional-warmup-caches` as the node script is broken and [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/890299 |... [11:01:31] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) ` 11:01 <+logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet 11:01 <+logmsgbot> !log cgouber... [11:02:06] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) ` 11:01 <+logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks 11:01 <+logmsgbot>... [11:03:08] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) ` 11:02 <+logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance 11:02 <+logmsgbot> !log cgoub... [11:05:10] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) ` 11:03 <+logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.02-set-readonly 11:03 <+logmsgbot> !log cgoubert@... [11:05:21] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) ` spicerack.mysql_legacy.MysqlLegacyError: Unable to get heartbeat from master db1118.eqiad.wmnet for section s1 ` [11:12:33] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) Error seems to come from cumin query: ` 2023-02-06 12:09:06,872 DRY-RUN cgoubert 2367071 [ERROR _menu.py:261 in run] Exception raised... [11:15:46] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) ` 11:13 <+logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki 11:13 <+logmsgbot> !log cgoub... [11:25:00] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) ` 11:18 <+logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters 11:24 how man... [12:04:01] claime: how did the live test go ? [12:04:12] read -ops lol [12:04:16] But long story short [12:04:25] it does not reset DC_FROM to rw [12:04:40] So we had errors for a few minutes until we found out what happened [12:05:29] ok, but it also shouldn't reset DC_FROM to rw when not in multi-dc situation [12:05:37] that is for the 1st week [12:05:41] right ? [12:06:43] Well apparently having a DC with mwconfig select name=ReadOnly true creates read-only errors for users [12:06:49] Even if it's codfw [12:07:18] ofc, we are currently directing traffic there [12:07:33] ah, mw config [12:08:02] yeah, that one should be set back to rw, it's the DBs that should remain RO [12:08:15] yeah [12:28:33] the extra new k8s nodes appear ok from a cursory look, I 'll make sure when I am back from an errand [13:42:28] 10serviceops, 10Prod-Kubernetes, 10Scap: scap deploys are taking > 30 minutes due to docker images timing out - https://phabricator.wikimedia.org/T330291 (10Ladsgroup) [14:17:25] 10serviceops, 10Prod-Kubernetes, 10Scap: scap deploys are taking > 30 minutes due to docker images timing out - https://phabricator.wikimedia.org/T330291 (10akosiaris) p:05Unbreak!→03High We added back the lost capacity (4 nodes) in T330048, we should be ok again. Lowering and once we got confirmation, w... [14:23:32] 10serviceops, 10Prod-Kubernetes, 10Scap: scap deploys are taking > 30 minutes due to docker images timing out - https://phabricator.wikimedia.org/T330291 (10Ladsgroup) The broken one: ` 14:15:34 Synchronized wmf-config/core-Permissions.php: Move all of userrights config out of IS.php to a dedicated file, par... [14:28:20] 10serviceops, 10Prod-Kubernetes, 10Scap: scap deploys are taking > 30 minutes due to docker images timing out - https://phabricator.wikimedia.org/T330291 (10akosiaris) 05Open→03Resolved a:03akosiaris Cool, I think we fixed this one. Thanks for raising it! [15:13:35] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: sre.switchdc.mediawiki.07-set-readwrite doesn't reset both datacenter to rw - https://phabricator.wikimedia.org/T330300 (10Clement_Goubert) [15:14:00] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: sre.switchdc.mediawiki.07-set-readwrite doesn't reset both datacenter to rw - https://phabricator.wikimedia.org/T330300 (10Clement_Goubert) p:05Triage→03High [15:15:04] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: sre.switchdc.mediawiki.07-set-readwrite doesn't reset both datacenter to rw - https://phabricator.wikimedia.org/T330300 (10taavi) [15:17:58] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: sre.switchdc.mediawiki.03-set-db-readonly fails in live-test mode - https://phabricator.wikimedia.org/T330302 (10Clement_Goubert) [15:18:26] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: sre.switchdc.mediawiki.03-set-db-readonly fails in live-test mode - https://phabricator.wikimedia.org/T330302 (10Clement_Goubert) p:05Triage→03High [15:18:35] eventstreams-internal fixed [15:20:12] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: Ensure sre.switchdc.mediawiki live test multi-DC compatibility - https://phabricator.wikimedia.org/T329065 (10Clement_Goubert) [15:20:19] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: sre.switchdc.mediawiki.07-set-readwrite doesn't reset both datacenter to rw - https://phabricator.wikimedia.org/T330300 (10Clement_Goubert) [15:20:28] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: Ensure sre.switchdc.mediawiki live test multi-DC compatibility - https://phabricator.wikimedia.org/T329065 (10Clement_Goubert) 05Stalled→03In progress [15:20:32] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert) [15:20:38] akosiaris: awesome, thanks [15:21:23] 10serviceops, 10MW-on-K8s, 10Datacenter-Switchover: Prepare mw-on-k8s for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327924 (10Clement_Goubert) 05Open→03Resolved [15:21:25] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert) [15:21:35] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [15:25:12] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Sustainability (Incident Followup): Globalize mwconfig ReadOnly - https://phabricator.wikimedia.org/T330304 (10Clement_Goubert) [15:25:29] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Sustainability (Incident Followup): Globalize mwconfig ReadOnly - https://phabricator.wikimedia.org/T330304 (10Clement_Goubert) p:05Triage→03Medium [15:35:51] 10serviceops, 10SRE, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10MW-1.38-notes (1.38.0-wmf.19; 2022-01-24), and 2 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jdforrester-WMF) 05Open→03Resolved I'll mark this as Resolved. No-one wants to confess to knowing what the remain... [17:01:01] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10akosiaris) [17:01:08] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10akosiaris) 05Open→03Resolved The extra hosts have been re-imaged, the cluster has been put back in rotation, serving traffic successfully. I am resolving this. \o/ [17:10:58] 10serviceops, 10Data-Persistence (work done), 10SRE, 10Datacenter-Switchover, 10Sustainability (Incident Followup): Globalize mwconfig ReadOnly - https://phabricator.wikimedia.org/T330304 (10Ladsgroup) [17:11:51] 10serviceops, 10Data-Persistence (work done), 10SRE, 10Datacenter-Switchover, and 2 others: sre.switchdc.mediawiki.07-set-readwrite doesn't reset both datacenter to rw - https://phabricator.wikimedia.org/T330300 (10Ladsgroup) [17:12:33] 10serviceops, 10Data-Persistence (work done), 10SRE, 10Datacenter-Switchover: sre.switchdc.mediawiki.03-set-db-readonly fails in live-test mode - https://phabricator.wikimedia.org/T330302 (10Ladsgroup) [18:45:47] 10serviceops, 10SRE, 10noc.wikimedia.org: make noc.wikimedia.org active/active (was: improve mw maintenance server switch over and discovery names) - https://phabricator.wikimedia.org/T265936 (10Dzahn) a:05Dzahn→03None removing assignee based on automated mail from Andre pointing out it has been assigned... [19:26:10] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10Ottomata) Hi, just FYI, this did cause some issues in the Analytics Cluster. Context here: https://phabricator.wikimedia.org/T330236#8637831 This isn't your fault, more of a desig... [22:57:32] 10serviceops, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10colewhite)