[02:11:22] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Dzahn) [07:47:01] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes: etcd cluster reimage strategies to use with the K8s upgrade cookbook - https://phabricator.wikimedia.org/T330060 (10elukey) @JMeybohm did you have to set the cluster's status to `new` by any c... [11:01:06] Ok I was doing a last check of our appserver clusters [11:01:24] 4 less parsoid hosts in codfw than in eqiad [11:01:32] 10 less appservers [11:01:40] 10 more jobrunners [11:01:44] 10 more videoscalers [11:14:04] not surprising, there's 30ish servers in mw2* in role::insetup::serviceops [11:15:20] <_joe_> 10 more videoscalers? [11:15:27] <_joe_> ok that doesn't sound right [11:15:58] <_joe_> and given those 30 servers are replacements, that doesn't really change things in terms of balance [11:16:10] This is from for cluster in parsoid appserver api_appserver jobrunner videoscaler; do for dc in eqiad codfw; do echo $cluster $dc; sudo confctl select "dc=$dc,cluster=$cluster" get | wc -l; done; done [11:16:20] So those insetup would not be counted [11:17:08] <_joe_> claime: uhm something is fishy [11:17:18] <_joe_> I count 5 more jobrunners as far as servers go [11:17:33] Let me check one by one [11:17:49] <_joe_> and 7 less appservers [11:18:04] Ah wait [11:18:08] They have 2 services [11:18:50] <_joe_> yes [11:18:52] <_joe_> canary :) [11:18:55] <_joe_> I was about to say [11:19:04] Also nginx/apache2 for videoscaler [11:20:14] With a split on hostname | sort | uniq, I have +5 jobrunners, +5 videoscalers, +2 api_appservers, -7 appservers, -4 parsoid [11:26:53] <_joe_> the only thing slightly worrisome is the -4 parsoids imho [11:26:59] <_joe_> but not really an issue [11:34:18] we're at way less than 50% CPU and memory usage in eqiad, so we should be fine even losing 4 hosts (36 cores) [11:35:27] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [12:00:57] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [12:19:10] Would I be safe to do a thumbor redeploy in k8s? Given that we reduced the number of replicas previously etc [13:11:51] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [13:13:24] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [13:34:11] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [13:57:26] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) [13:59:45] hnowlan: the cluster is back to full capacity, so go ahead [14:08:59] jayme: cool, thanks! [14:17:49] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) It happend. The next step, next week: debrief the process. [14:26:37] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [15:10:21] I found out the list of hosts for a dsh target (used by scap) can be populated automatically from Puppet DB. That got first introduced for the k8s workers in https://gerrit.wikimedia.org/r/c/operations/puppet/+/859466 [15:11:01] I have proposed a series of change to rely on Puppet DB queries instead of a manually maintained list of host. That would slightly simplify the hieradata copy pasting and ensure the list of targets is consistent [15:11:13] no rush, it is merely an improvement =) [15:38:41] 10serviceops, 10Citoid: citoid having stability issues - https://phabricator.wikimedia.org/T330768 (10JMeybohm) [17:26:05] 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10lbowmaker) [22:51:00] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 (10Dzahn) After the switch of the apt servers we are getting alerting about bad systemd status on apt1001. ` <+icinga-wm> PROBLEM - Check systemd state...