[08:59:02] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Scap, 10Upstream: Kubernetes configuration file is group-readable - https://phabricator.wikimedia.org/T329899 (10hashar) https://github.com/helm/helm/issues/11083#issuecomment-1221635340 claims it can be worked around by setting `KUBECONFIG=":path/to/insecure"`. Might be worth...
[09:00:40] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Scap, 10Upstream: Kubernetes configuration file is group-readable - https://phabricator.wikimedia.org/T329899 (10hashar)
[09:11:14] <jayme>	 Hello, hello. So apart from akosiaris note about ferm rules there don't seem to be any other notes. I'll start depooling traffic from codfw now
[09:12:12] <claime>	 jayme: ack
[09:12:16] <claime>	 Godspeed
[09:12:21] <jayme>	 <3
[09:24:46] <claime>	 Cookbook working alright?
[09:26:36] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: spicerack.mysql_legacy errors on get_core_masters_heartbeats when checking x2 - https://phabricator.wikimedia.org/T329533 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert dry-run looks good, resolving
[09:26:50] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert)
[09:27:43] <jayme>	 claime: yeah. I needed to double check in code if "skip" and "move" really are the trigger words for restbase, but apart from that it looks smooth
[09:27:58] <claime>	 jayme: ok, will make it more explicit
[09:28:48] <jayme>	 claime: actually I think "ask_input" should probably do it in a generic way
[09:29:05] <claime>	 Fair enough
[09:32:13] <jayme>	 oh and --reason is not supported
[09:32:25] <jayme>	 or --task
[09:32:37] <claime>	 ack
[09:33:07] <claime>	 --reason flag is present but it probably isn't passed through
[09:33:18] <jayme>	 but it completed successfully - that's the main point I suppose :)
[09:34:02] <jayme>	 hmm...I think I ran with --reason and it complained
[09:34:30] <claime>	 I meant the argument exists in the code, not saying it works lol
[09:34:37] <claime>	 I haven't checked that part
[09:35:03] <jayme>	 maybe just the order of arguments
[09:35:05] <jayme>	 jayme@cumin1001:~$ sudo cookbook sre.discovery.datacenter depool --reason T327991 codfw                                                                                                                                                                                                                                 
[09:35:06] <jayme>	 usage: cookbook depool [-h] [--all] [--fast-insecure FAST_INSECURE] {eqiad,codfw}                                                                                                                                                                                                                                       
[09:35:08] <jayme>	 cookbook depool: error: argument datacenter: invalid choice: 'T327991' (choose from 'eqiad', 'codfw')  
[09:35:19] <claime>	 Ah yes
[09:35:28] <jayme>	 that was copy/pasta from the task
[09:35:55] <jayme>	 then I ran  "sudo cookbook sre.discovery.datacenter depool -h" which does not show --reason as valid argument either :)
[09:36:02] <jayme>	 then I gave up :D
[09:36:35] <claime>	 cgoubert@cumin1001:~/cookbooks$ sudo cookbook -d sre.discovery.datacenter --reason blah depool codfw                                                                                                                                                 
[09:36:37] <claime>	 DRY-RUN: Executing cookbook sre.discovery.datacenter with args: ['--reason', 'blah', 'depool', 'codfw']
[09:36:43] <claime>	 Yeah it wants it before the action
[09:36:51] <jayme>	 makes absolute sense :)
[09:38:09] <claime>	 It's because the actions are a subparser
[09:38:24] <jayme>	 yeah, guessed so
[09:38:53] <claime>	 I'll move it since we don't need a reason or task to run the other action which is NOT in a subparser, status
[09:39:13] <jayme>	 +1
[09:40:07] <wikibugs>	 10serviceops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10JMeybohm) Global depool of a/a services from codfw is done.
[09:40:28] <jayme>	 dcausse: You mind doing your thing to flink now? :)
[09:40:43] <dcausse>	 jayme: sure!
[09:41:23] <jayme>	 dcausse: oh wait...wdqs is not depooled in codfw
[09:41:30] <dcausse>	 ok waiting :)
[09:42:06] <jayme>	 ah, it's excluded for capacity reasons
[09:42:19] <jayme>	 https://phabricator.wikimedia.org/T329193
[09:42:34] <dcausse>	 hm...
[09:43:08] <dcausse>	 gehel: what do you think? should we serve stale data or depool and risk overloading eqiad?
[09:43:08] <jayme>	 but that's probably just for the switchover...I fuess
[09:43:11] <jayme>	 claime: 
[09:43:17] <claime>	 yes?
[09:43:31] <claime>	 Ah yeah
[09:43:33] <jayme>	 what about wdqs being excluded from depooling?
[09:43:53] <jayme>	 (sorry for the unpolite ping - I hit enter to fast :))
[09:43:53] <claime>	 Yeah I guess that doesn't work too well with what you want to do
[09:43:55] <dcausse>	 jayme: I'd be tempted to depool codfw to avoid serving stale data
[09:44:14] <gehel>	 we can always try depooling and see how it goes. Worst case, we'll repool.
[09:44:17] <claime>	 Just use the service-route cookbook
[09:44:40] <gehel>	 Why would we have stale data? Is the eqiad updater down during the switch?
[09:45:13] <jayme>	 ack. I was intersted in the reason for it being excluded and if we risk overloading eqiad if we run for a couple of hours from there only
[09:45:14] <claime>	 You won't have stale data, you will have no data from codfw, jayme is upgrading the k8s cluster
[09:45:32] <claime>	 It does run on the wikikube cluster right?
[09:45:43] <claime>	 (sorry I'm not completely awake yet)
[09:45:48] <jayme>	 claime: the updater does (rdf-streaming-updater)
[09:45:56] <claime>	 Ah right, so stale data
[09:45:58] <jayme>	 wqds itself does not. So stale data is right
[09:45:58] <claime>	 :p
[09:46:03] <gehel>	 no updates from codfw means stale data
[09:46:33] <gehel>	 if it is for a short time, we can keep an eye on things and hope that we don't overload
[09:47:05] <jayme>	 gehel: couple of hours really
[09:47:27] <gehel>	 should be fine! worst case, we'll repool and bite the stale data
[09:47:36] <dcausse>	 +1
[09:48:39] <jayme>	 ack. depooling wdqs and wdqs-ssl from codfw
[09:48:47] <dcausse>	 kk
[09:49:58] <jayme>	 there is already an alert firing for rdf in codfw
[09:50:02] <jayme>	 that's unexpected
[09:50:43] <dcausse>	 it recovered
[09:50:48] <jayme>	 indeed :)
[09:50:58] <dcausse>	 not sure what happened
[09:51:10] <dcausse>	 there's a hole in the graph here https://grafana-rw.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater?orgId=1&var-site=codfw&var-k8sds=codfw%20prometheus%2Fk8s&var-opsds=codfw%20prometheus%2Fops&var-cluster_name=wdqs
[09:51:42] <dcausse>	 seems like it stopped consuming from kafka
[09:54:38] <dcausse>	 jayme: did you depool thanos or something like that, this would explain?
[09:54:47] <dcausse>	 anyways it's running fine now
[09:55:03] <jayme>	 dcausse: yes. thanos I've depooled
[09:55:37] <dcausse>	 ok it switched to writing thanos@eqiad then and that caused a restart most probably
[09:55:47] <jayme>	 ack
[09:57:38] <jayme>	 claime: FTR depooling of wdqs-ssl failed via service-route, there are no confctl entries for it
[09:57:48] <claime>	 ><
[09:58:24] <jayme>	 it's dnsdisc entry is wdqs from what I seen in service.yaml
[09:59:02] <jayme>	 dcausse: wdqs is depooled from codfw now
[09:59:44] <dcausse>	 jayme: ok, stopping the updater
[10:05:38] <dcausse>	 jayme: all done, jobs are stopped in codfw
[10:12:12] <jayme>	 dcausse: ok, cool. So you've got everything you need and I'm okay to whipe the cluster state?
[10:13:51] <dcausse>	 jayme: yes, you can go ahead
[10:14:51] <jayme>	 great, thanks!
[10:15:52] <jayme>	 akosiaris: you around?
[10:23:47] <jayme>	 taking that as a "no" :) from kubernetes overview as well as envoy telemetry dashboards I see amost no traffic to k8s services. A couple of req/s still but I'd assume these are monitoring checks
[10:33:19] <claime>	 jayme: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/890782 ;)
[10:33:42] <jayme>	 node_ipvs_backend_connections_active{site="codfw"} still shows a couple of connections to mobileapps and sessionstore - looks like only health checks still
[10:34:36] <jayme>	 if there are no objections (or other/better ways to check traffic) I would continue and wipe the cluster at 10:40Z
[10:41:28] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10JMeybohm)
[10:43:53] <wikibugs>	 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host mc-gp1003.eqiad.wmnet with OS bullseye
[10:46:48] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=4830df3c-5faf-4828-b18a-c26399acca43) set by jayme@cumin1001 for 1 day, 0:00:00 on 23 host(s)...
[10:51:13] <akosiaris>	 jayme: I am around ;-)
[10:51:27] <akosiaris>	 how's the wiping going ? 
[10:51:31] <jayme>	 akosiaris: ok, last chance to stop me from cutting traffic :)
[10:51:51] <jayme>	 not sure if you hade some other checks in mind (regarding your comment on gerrit)
[10:53:44] <jayme>	 akosiaris: lmk if you want do double check on something 
[10:54:41] <akosiaris>	 niah, go ahead
[10:54:54] <jayme>	 ack
[10:59:49] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by root@cumin1001 for host kubetcd2004.codfw.wmnet with OS bullseye
[10:59:59] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jayme@cumin1001 for host kubetcd2005.codfw.wmnet with OS bullseye
[11:00:11] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jayme@cumin1001 for host kubetcd2006.codfw.wmnet with OS bullseye
[11:10:37] <wikibugs>	 10serviceops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10jbond)
[11:16:44] <wikibugs>	 10serviceops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 10 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10jbond)
[11:17:01] <wikibugs>	 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host mc-gp1003.eqiad.wmnet with OS bullseye completed: - mc-gp1003 (**PASS**)   - Downtimed on Icinga/Alertmanager...
[11:18:15] <wikibugs>	 10serviceops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 10 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10fgiunchedi)
[11:26:35] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jayme@cumin1001 for host kubetcd2006.codfw.wmnet with OS bullseye executed with errors: -...
[11:27:49] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jayme@cumin1001 for host kubetcd2005.codfw.wmnet with OS bullseye completed: - kubetcd2005...
[11:32:16] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by root@cumin1001 for host kubetcd2004.codfw.wmnet with OS bullseye completed: - kubetcd2004...
[11:33:39] <jayme>	 elukey: etcd is in happy condition - no TLS errors.
[11:33:56] <elukey>	 \o/
[11:34:08] <akosiaris>	 👍
[11:34:34] <elukey>	 did you check if they have the new SAN?
[11:34:37] <elukey>	 just to be sure
[11:36:07] <jayme>	 yeah DNS:kubetcd2005.codfw.wmnet, DNS:k8s3.codfw.wmnet, DNS:_etcd-server-ssl._tcp.k8s3.codfw.wmnet
[11:37:00] <elukey>	 super
[11:37:04] <jayme>	 upgrade cookbook is reimaging control-planes now
[11:37:30] <jayme>	 (and kubernetes2009 afterwards)
[11:39:58] <elukey>	 super
[11:40:08] <elukey>	 going to lunch, 3 nodes to reimage will take a bit
[11:41:11] <jayme>	 ack
[11:58:58] <akosiaris>	 jayme: I 'll deploy this one https://gerrit.wikimedia.org/r/c/operations/homer/public/+/890802 
[11:59:08] <akosiaris>	 so that the routers will accept the advertisements from the nodes
[11:59:25] <jayme>	 akosiaris: oh, yeah. Thanks!
[12:00:09] <jayme>	 akosiaris: does it make sense to add the new eqiad space right away as well?
[12:01:07] <akosiaris>	 I 'll craft the changes for sure, maybe just not deploy them right now
[12:40:07] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jayme@cumin1001 for host kubernetes2005.codfw.wmnet with OS bullseye
[12:40:10] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jayme@cumin1001 for host kubernetes2006.codfw.wmnet with OS bullseye
[12:40:19] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jayme@cumin1001 for host kubernetes2016.codfw.wmnet with OS bullseye
[12:40:21] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jayme@cumin1001 for host kubernetes2015.codfw.wmnet with OS bullseye
[12:43:29] <elukey>	 jayme: o/ do you want me to do some reimages?
[12:43:45] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host kubernetes2010.codfw.wmnet with OS bullseye
[12:43:50] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host kubernetes2020.codfw.wmnet with OS bullseye
[12:44:00] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host kubernetes2023.codfw.wmnet with OS bullseye
[12:44:29] <jayme>	 elukey: sure, happy to not have to do all of them
[12:44:46] <elukey>	 any range that you prefer?
[12:45:20] <jayme>	 I did start with all ganeti nodes and the metal ones affected by switch maintenance (as listed in https://phabricator.wikimedia.org/T329664)
[12:45:35] <jayme>	 pick anything you like really...just maybe add it to the task :)
[12:46:07] <elukey>	 2007->2012?
[12:46:26] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10JMeybohm)
[12:46:40] <elukey>	 anything special to do? Just running the cookbook reimage right? 
[12:46:41] <jayme>	 elukey: excluding 2009 and 2010
[12:47:18] <elukey>	 ah right
[12:47:26] <jayme>	 nothing special, just run reimage cookbook 
[12:47:36] <elukey>	 200[7,8] 20[11,12]
[12:47:49] <jayme>	 ack, adding to the task
[12:48:39] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10JMeybohm)
[12:49:15] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2007.codfw.wmnet with OS bullseye
[12:50:52] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2008.codfw.wmnet with OS bullseye
[12:51:02] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10JMeybohm)
[12:51:22] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host kubernetes2013.codfw.wmnet with OS bullseye
[12:53:27] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2011.codfw.wmnet with OS bullseye
[12:54:31] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2012.codfw.wmnet with OS bullseye
[12:54:32] <elukey>	 all 4 reimages started
[12:55:43] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host kubernetes2014.codfw.wmnet with OS bullseye
[12:57:40] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host kubernetes2022.codfw.wmnet with OS bullseye
[12:58:07] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host kubernetes2024.codfw.wmnet with OS bullseye
[12:59:49] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10JMeybohm)
[12:59:56] <jayme>	 ack. All reimages started, 2017-2021 failing with "Unable to establish IPMI v2 / RMCP+ session"
[13:01:27] * claime lunch
[13:07:22] <jayme>	 hmm 2017-2021 we're probably not able to reach https://phabricator.wikimedia.org/T330048
[13:08:40] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jayme@cumin1001 for host kubernetes2005.codfw.wmnet with OS bullseye completed: - kubernet...
[13:08:59] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10JMeybohm)
[13:12:17] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jayme@cumin1001 for host kubernetes2016.codfw.wmnet with OS bullseye completed: - kubernet...
[13:14:54] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host kubernetes2009.codfw.wmnet with OS bullseye
[13:15:07] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jayme@cumin1001 for host kubernetes2006.codfw.wmnet with OS bullseye completed: - kubernet...
[13:16:49] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jayme@cumin1001 for host kubernetes2015.codfw.wmnet with OS bullseye completed: - kubernet...
[13:17:11] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host kubernetes2023.codfw.wmnet with OS bullseye executed with errors:...
[13:21:39] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host kubernetes2020.codfw.wmnet with OS bullseye completed: - kubernete...
[13:25:30] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host kubernetes2010.codfw.wmnet with OS bullseye completed: - kubernete...
[13:25:59] <jayme>	 akosiaris: for when you're back: We can't reach kubernetes2017-2021 via IPMI (probably because of T330048). That means we're down 5 nodes. We have 2023 and 2024 still in insetup which still means we're lacking 3 nodes
[13:26:41] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2007.codfw.wmnet with OS bullseye executed with errors:...
[13:26:44] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host kubernetes2024.codfw.wmnet with OS bullseye executed with errors:...
[13:26:45] <jayme>	 we could ofc. just run puppet on 2017-2021 to upgrade to k8s 1.23 which in theory should work (I did that in pontoon as well)
[13:27:12] <jayme>	 ...yeah - let me post that on the task :)
[13:28:37] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host kubernetes2013.codfw.wmnet with OS bullseye executed with errors:...
[13:28:52] <elukey>	 2007 up and running
[13:32:01] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2012.codfw.wmnet with OS bullseye completed: - kubernet...
[13:33:04] <godog>	 jayme elukey what's the best place/way to followup wrt alerts and silences in the k8s upgrade cookbook? e.g. which taks ?
[13:33:19] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2008.codfw.wmnet with OS bullseye executed with errors:...
[13:33:23] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10JMeybohm)
[13:34:42] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host kubernetes2022.codfw.wmnet with OS bullseye executed with errors:...
[13:37:06] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2011.codfw.wmnet with OS bullseye completed: - kubernet...
[13:37:40] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10elukey) In the meantime we have created two cookbook:  * sre.k8s.upgrade-cluster.py * sre.k8s.wipe-cluster.py
[13:37:45] <elukey>	 godog: probably https://phabricator.wikimedia.org/T277677
[13:37:52] <elukey>	 2008 is up!
[13:38:39] <jayme>	 elukey: there are enough nodes for basic operation now. I'll start deploying admin_ng stuff
[13:39:05] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host kubernetes2014.codfw.wmnet with OS bullseye completed: - kubernete...
[13:39:25] <godog>	 thank you elukey 
[13:41:33] <wikibugs>	 10serviceops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=aec8ddda-9ad5-4b7f-8bca-c273e036a282) set by ayounsi@cumin1001 for 2:00:00 on 215...
[13:44:07] <elukey>	 jayme: 2011 and 2012 up as well
[13:45:07] <jayme>	 elukey: nice
[13:45:22] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10fgiunchedi) Following up for silences, especially the ones paging in production (`ProbeDown`).  * ProbeDown: the most...
[13:45:30] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10elukey)
[13:45:43] <elukey>	 jayme: if you want to offload more to me lemme know
[13:46:45] <jayme>	 elukey: the only one left is 2009 which failed on first try plus the ones we cant get to :/
[13:47:35] <jayme>	 elukey: if you're up to it you could double check that all reimages acutally worked (I did get a lot of failures) and puppet has successfully run everythere
[13:48:13] <elukey>	 sure
[13:48:14] <jayme>	 elukey: apart from 2017-2021,2023-2024 obviously
[13:48:26] <elukey>	 I got failures too, but mostly downtime-related
[13:48:34] <jayme>	 oh and 2009 :) - which is still ongoing
[13:48:44] <jayme>	 yeah, mine as well
[13:49:08] <jayme>	 or not being able to run puppet for ssh key update IIUC
[13:51:58] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host kubernetes2009.codfw.wmnet with OS bullseye completed: - kubernete...
[13:52:16] <jayme>	 elukey: 2009 done as well
[13:52:44] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10JMeybohm)
[13:54:08] <elukey>	 jayme: checked up to 2016, all good afaics
[13:55:26] <wikibugs>	 10serviceops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Vgutierrez)
[13:57:39] <jayme>	 elukey: cool, thanks
[14:01:08] <jayme>	 elukey: akosiaris: admin_ng deployed without issues
[14:01:52] <elukey>	 woww
[14:05:38] <jayme>	 elukey: do you have time to check if we can easily add 2023 and 2024 to the cluster?
[14:06:26] <jayme>	 I would start deploying services back now...maybe, if we don't deploy things like thumbor, we have enough capacity so run without 2017-2021 for a bit
[14:06:37] <elukey>	 ok lemme check
[14:06:44] <jayme>	 hoping the mgmnt access is restored before switchover 
[14:06:51] <jayme>	 elukey: ❤️
[14:13:12] <elukey>	 I am following https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/Add_or_remove_nodes
[14:13:31] <elukey>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/890824
[14:14:19] <jayme>	 elukey: is it fine that the selector would include kubernetes102[34] as well?
[14:14:37] <elukey>	 ah no snap
[14:14:56] <elukey>	 good point
[14:16:02] <elukey>	 ok opted for a quick fix, lemme know if it is ok
[14:16:09] <elukey>	 in the meantime, I am doing homer's patch
[14:17:22] <jayme>	 elukey: I suppose those nodes are in row-e|f - I'm not sure we are prepared for that in wikikube (bgppeer selector wise)
[14:18:40] <elukey>	 jayme: nope B and D
[14:18:47] <elukey>	 we should be good
[14:18:49] <jayme>	 oh, great :D
[14:18:59] <elukey>	 (do codfw have e/f?)
[14:19:22] <claime>	 B is going down about now
[14:26:45] <elukey>	 ok so
[14:26:50] <elukey>	 homer change: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/890834/
[14:27:15] <elukey>	 and the three puppet ones https://gerrit.wikimedia.org/r/c/operations/puppet/+/890824
[14:28:04] <elukey>	 jayme: --^
[14:28:16] <elukey>	 the order should be the one indicated in the wiki
[14:28:20] <claime>	 jayme: fyi, merging the logging improvements / task id / reason improvements to sre.discovery.datacenter
[14:36:43] <jayme>	 elukey: looks correct to me
[14:37:01] <elukey>	 jayme: should I proceed, since the row-b maintenance is done?
[14:37:18] <wikibugs>	 10serviceops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10ayounsi) Upgrade went smoothly, less than 15min hard downtime here too.
[14:37:29] <jayme>	 it is already...ah lol
[14:37:35] <claime>	 yeah :D
[14:37:45] <jayme>	 elukey: yeah, go ahead then please
[14:37:56] <wikibugs>	 10serviceops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10fgiunchedi)
[14:45:18] <wikibugs>	 10serviceops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10jbond)
[14:45:53] <jayme>	 echostore kask is not coming back-up (and not logging anything) - I would bet some networking issue because of the new ip range
[14:51:26] <wikibugs>	 10serviceops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10fgiunchedi)
[14:52:55] <jayme>	 looks like I've missed yet another list of ip ranges https://gerrit.wikimedia.org/r/c/operations/puppet/+/890838
[14:53:58] <jayme>	 not sure why that should make echostores health check fail but anyhow...
[14:54:08] <elukey>	 +1ed
[14:54:19] <jayme>	 thanks
[14:54:25] <elukey>	 well worth to check after a puppet run :)
[14:55:21] <elukey>	 in the meantime, I am running homer
[14:59:45] <elukey>	 ok I see the nodes in Ready state
[14:59:51] <jayme>	 \o/
[15:00:26] <elukey>	 but the BGP session with cr1-eqiad is still in "Active"
[15:01:22] <elukey>	 yeah calico pods are not up
[15:03:21] <elukey>	 it says
[15:03:22] <elukey>	 bird: Unable to open configuration file /etc/calico/confd/config/bird6.cfg: No such file or directory
[15:03:42] <elukey>	 mmm do I need to merge the other two puppet changes?
[15:03:44] <elukey>	 jayme: --^
[15:04:28] <akosiaris>	 what's up with 2017-2021  ?
[15:04:30] <akosiaris>	 may I help ? 
[15:04:55] <elukey>	 ah no
[15:04:56] <elukey>	  felix/sync_client.go 158: error connecting to typha endpoint (2 of 3) 10.192.32.109:5473 connID=0x0 error=dial tcp 10.192.32.109:5473: i/o timeout type=""
[15:05:01] <claime>	 akosiaris: management interface broken
[15:05:13] <jayme>	 elukey: I think the bgppeers onw might be needed
[15:05:34] <jayme>	 elukey: in case you wonder: I've cordoned 23 and 24
[15:05:39] <claime>	 So they're putting 2023/2024 in production
[15:05:39] <elukey>	 super
[15:06:00] <wikibugs>	 10serviceops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10jcrespo) I restarted es5 codfw backup job, the only backup-related thingy affected by the downtime.
[15:06:44] <wikibugs>	 10serviceops, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 7 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ayounsi) p:05Triage→03Medium
[15:07:21] <wikibugs>	 10serviceops, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 7 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ayounsi)
[15:09:19] <elukey>	 jayme: yep all running now
[15:09:38] <jayme>	 akosiaris: we're unable to reimage 2017-2021 - unable to access management interface. That's why elukey is putting 2023 and 2024 in production. We will still be down 3 nodes, though
[15:09:44] <jayme>	 elukey: cool!
[15:10:03] <akosiaris>	 down 3 nodes... ouch
[15:10:11] <elukey>	 and the routers are happy as well (BGP sessions in Established)
[15:10:14] <akosiaris>	 some things aren't going to be deployable I fear
[15:10:30] <jayme>	 akosiaris: I'm thinking about deploying thumbor with less replicas or not at all
[15:10:35] <akosiaris>	 or are going to be barely deployable
[15:10:44] <akosiaris>	 jayme: yeah, go for less replicas
[15:10:52] <akosiaris>	 not sure if we can do not at all
[15:10:54] <akosiaris>	 can we? 
[15:10:58] <jayme>	 I'm not sure if it get's traffic at all currently
[15:11:06] <jayme>	 hnowlan: ?
[15:11:07] <akosiaris>	 I guess we can just remove it from confctl overall
[15:11:16] <akosiaris>	 I think it's paused on some swift errors 
[15:11:17] <elukey>	 jayme: nodes added to kubesvc too (didn't check if they are pooled or not though)
[15:11:30] <elukey>	 so 2023/2024 should be ready to go
[15:11:34] <elukey>	 (need to join a meeting)
[15:11:35] <jayme>	 elukey: ack. uncordoned
[15:11:41] <wikibugs>	 10serviceops, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 7 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10jcrespo)
[15:11:44] <jayme>	 elukey: thank you so much!
[15:11:55] <elukey>	 <3
[15:13:20] <wikibugs>	 10serviceops, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 7 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10MoritzMuehlenhoff)
[15:13:23] <jayme>	 akosiaris: hnowlan: I've (manually) scaled thumbor down to 1 replica in codfw
[15:14:53] <jayme>	 echostore now coming up as well with https://gerrit.wikimedia.org/r/c/operations/puppet/+/890838 merged
[15:16:53] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE Observability (FY2022/2023-Q3): Index orchestrator object fields from ECS 1.11.0 in OpenSearch - https://phabricator.wikimedia.org/T328318 (10lmata)
[15:17:03] <jayme>	 akosiaris: do you have time to read backlog and at some point take this over from me? I'm already a bit over the appropriate time window for sick + working :-|
[15:19:44] <jayme>	 also scaled mw-api-ext/mediawiki-main from 4 to 2 replicas to have it sheduled
[15:21:34] <akosiaris>	 jayme: yeah, go ahead
[15:22:00] <jayme>	 scaled mw-debug/mediawiki-pinkunicorn from 2 to 1 replicas
[15:22:01] <claime>	 jayme: you can scale down anything mw related, there's only test2wiki
[15:22:55] <jayme>	 in that case: scaled mw-web/mediawiki-main from 8 to 4 replicas
[15:23:04] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10akosiaris)
[15:23:14] <claime>	 ack
[15:23:26] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10akosiaris)
[15:25:13] <jayme>	 akosiaris: all services deployed to codfw
[15:25:34] <wikibugs>	 10serviceops, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q3): Port mediawiki prometheus-based alerts from icinga to alertmanager - https://phabricator.wikimedia.org/T312764 (10lmata)
[15:25:35] <jayme>	 datahub and eventstreams-internal have (unexpected) new versions
[15:25:50] <claime>	 Amir.1 is taking the opportunity of mw codfw being depooled to run some drift correction
[15:26:05] <claime>	 ETA 1h
[15:26:06] <jayme>	 dcausse: you may start rdf-streaming-updater now
[15:26:46] <dcausse>	 inflatador: ^
[15:27:16] <akosiaris>	 jayme: I am updating https://phabricator.wikimedia.org/T329664, I think the only too left out are  Reimage nodes: 2017-2021 ("Unable to establish IPMI v2 / RMCP+ session") and Lift downtimes, right ? 
[15:27:34] <akosiaris>	 and then start looking into what's the situation overall with workloads being deployed and so on
[15:27:35] <inflatador>	 dcausse jayme ACK
[15:27:39] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10JMeybohm)
[15:28:15] <akosiaris>	 jayme: scale downs were manual ? that's is kubectl edit? or via helmfile ? 
[15:28:47] <akosiaris>	 kubectl edit is my guess ;-)
[15:28:48] <jayme>	 akosiaris: downtimes I will lift in a minute because I've the upgrade cookbook sitting in a tmux
[15:28:56] <jayme>	 akosiaris: kubectl scale, yes
[15:29:02] <akosiaris>	 cool, ok, thanks
[15:29:18] <claime>	 Which means next scap will rescale up mw services
[15:29:21] <jayme>	 so. what's left is checking things, and probably repool services
[15:29:37] <jayme>	 oh..and fix what claime just said :)
[15:29:45] <claime>	 jayme: hold on repool, see what I said about Amir and drift correction
[15:30:00] <jayme>	 yesyes, sure
[15:30:14] <jayme>	 but it's still something that needs to be done :)
[15:30:59] <claime>	 yes, was just making sure it wasn't lost in the noise
[15:32:49] <hnowlan>	 jayme: ack, thanks for that - thumbor isn't pooled at the moment so it's no problem 
[15:36:06] <claime>	 amir's done, we can repool whenever
[15:36:11] <claime>	 jayme: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/890842
[15:36:22] <claime>	 Final touches to add -r and -t to the cookbook
[15:36:53] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10JMeybohm)
[15:37:09] <jayme>	 inflatador: dcausse: please tell, update https://phabricator.wikimedia.org/T329664 when you're done
[15:37:44] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10JMeybohm)
[15:38:03] <dcausse>	 jayme: sure, we'll do
[15:40:40] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10JMeybohm)
[15:41:32] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10JMeybohm)
[15:41:36] <jayme>	 akosiaris: I think I've added everything that's left as todo to the task
[15:41:49] <claime>	 I broke the cookbook. Sorry.
[15:41:53] <claime>	 Fixing asap
[15:42:43] <akosiaris>	 jayme: thanks!
[15:43:01] <inflatador>	 jayme we are done redeploying rdf-streaming-updater in codfw
[15:43:14] <inflatador>	 updating phab task as well
[15:43:17] <akosiaris>	 inflatador: cool, thanks!
[15:43:23] <akosiaris>	 I guess I can repool wdqs
[15:43:46] <dcausse>	 akosiaris: not yet we need to wait a bit for the lag to catch up on wdqs nodes
[15:43:54] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10bking)
[15:43:59] <akosiaris>	 ok, let us know when then. 
[15:44:17] <dcausse>	 sure
[15:45:42] <jayme>	 akosiaris: I'll step away now. Won't resurface for the meeting and maybe I guess I won't be around tomorrow as well (was maybe a bit much today.. :/ ). But if there's anything I left unclear, please feel free to send a msg via signal - I'll respond if I'm awake :)
[15:46:03] <inflatador>	 akosiaris jayme based on lag stats at https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater?orgId=1&var-site=codfw&var-k8sds=codfw%20prometheus%2Fk8s&var-opsds=codfw%20prometheus%2Fops&var-cluster_name=wdqs&from=now-15m&to=now , we should be able to depool in ~30m
[15:46:04] <akosiaris>	 ok, we got it from there. Go rest and many thanks! 
[15:46:11] <inflatador>	 err...repool that is
[15:46:22] <jayme>	 🤗
[15:46:30] <inflatador>	 +1, thanks for your help!
[15:46:34] <akosiaris>	 ok, that coincides with our meeting, we 'll give it another 20-25 mins extra in that case
[15:47:13] <inflatador>	 It
[15:47:25] <claime>	 ok, so the cookbook works for pooling/depooling, status command is borked
[15:47:34] <claime>	 so at least the functionality is ok
[15:47:43] <inflatador>	 it's just a matter of using the confctl to repool CODFW into discovery? or is there more? If it's just confctl I can do that myself
[15:48:58] <akosiaris>	 inflatador: it's just that indeed
[15:49:09] <akosiaris>	 wdqs and wdqs-internal 
[15:49:31] <gehel>	 and wcqs as well (I think)
[15:49:40] <inflatador>	 akosiaris ACK, then I will take care of those once they are caught up
[15:49:50] <akosiaris>	 gehel: good point, wcqs too
[15:50:34] <wikibugs>	 10serviceops, 10Maps, 10Observability-Metrics, 10SRE, and 3 others: SLO dashboards with N latency targets - https://phabricator.wikimedia.org/T320749 (10lmata)
[15:51:11] <claime>	 akosiaris: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/890844/ < fix for my borkage
[15:51:12] <wikibugs>	 10serviceops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Jelto)
[15:53:51] <volans>	 claime: required=False and default=None are the default for add_argument
[15:54:06] <claime>	 volans: explicit is better no?
[15:54:27] <volans>	 then why all the others  don't have default ? :D
[15:54:40] <claime>	 volans: because I'm inconsistent (and honest :P)
[15:54:43] <volans>	 ahahaha
[15:55:13] <volans>	 explicit is better when it's counter intuitive or not too verbose, I find adding too many defaults makes it less readable, but that's me
[15:55:16] <akosiaris>	 jerkins-bot -1ed you :P
[15:55:16] <claime>	 I'll remove it, I was actually trying to figure out an args namespace error that has nothing to do with this part
[15:55:23] <claime>	 akosiaris: line too long'd
[15:56:18] <claime>	 Well not nothing, but that wouldn't fix it
[15:58:43] <claime>	 At least it only breaks status, which I may be the only one using for now :D
[16:03:59] <claime>	 volans: should be better now :)
[16:13:54] <wikibugs>	 10serviceops, 10SRE: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 (10MoritzMuehlenhoff) PHP build depends on libxml2, which itself also uses ICU by default. I have patched it to build without ICU for the component/icu67 component, it falls back to iconv internally.
[16:45:53] <akosiaris>	 out of meeting, I am gonna repool kubernetes codfw 
[16:46:30] <claime>	 ack
[16:47:16] <wikibugs>	 10serviceops, 10SRE: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10RLazarus) We decided we'll put these into service after the upcoming DC switchover, so we'll make a plan at the March 6 serviceops meeting.
[16:49:55] <inflatador>	 backlog looks pretty good, will repool wdqs/wdqs-internal/wcqs
[16:53:51] <claime>	 akosiaris: did you provide a reason on your cli call to the cookbook ? it shouldn't be None :( one more bug to squash
[16:54:12] <akosiaris>	 I thought I did
[16:54:16] <akosiaris>	 sre.discovery.datacenter pool --reason T327991 codfw
[16:54:20] <akosiaris>	 aaaah
[16:54:21] <akosiaris>	 dammit
[16:54:24] <claime>	 Ah lol
[16:54:26] <akosiaris>	 and I even reviewed
[16:54:44] <claime>	 I was *very* confused for a second
[16:54:45] * akosiaris need sleeeeeep
[16:55:17] <akosiaris>	 we are halfway done btw, mw-api-ext-ro now
[16:55:45] <claime>	 Yeah I'm checking my pretty status output :P
[16:57:45] <akosiaris>	 yeah, it's useful
[16:58:03] <akosiaris>	 it does log on sal however
[16:58:07] <claime>	 -d
[16:58:32] <claime>	 It was a bit too much of logger trickery to make it not log to sal, I decided not to
[16:59:08] <claime>	 I have a watch sudo cookbook -d sre.discovery.datacenter status all
[16:59:11] <claime>	 running
[17:00:21] <inflatador>	 wdqs/wdqs-internal/wcqs CODFW are now repooled
[17:00:30] <claime>	 thanks inflatador 
[17:00:52] <inflatador>	 no worries. They pay me for this ;P
[17:01:01] <claime>	 Wait WHAT?
[17:01:06] <claime>	 :[
[17:07:15] * inflatador fans out his extensive collection of $2 bills
[17:08:47] <elukey>	 akosiaris, claime - I just set pooled=yes/weight=10 for kubernetes202[3,4]
[17:08:59] <elukey>	 (for kubesvc)
[17:10:45] <akosiaris>	 thanks
[17:11:11] <akosiaris>	 inflatador: I just got the $2 bill reference. Lol 
[17:12:38] <claime>	 akosiaris: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/890882/ < scale reduction for codfw mw-*
[17:14:00] * claime waits patiently for jenkins
[17:28:27] <claime>	 akosiaris: Do you know how to force helmfile to stop wth it's doing and do what I'm telling it to?
[17:28:30] <claime>	   Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress
[17:29:43] <akosiaris>	 claime: you can destroy the release
[17:29:45] <akosiaris>	 but that's about it
[17:29:54] <claime>	 a'ight then
[17:29:57] <claime>	 let's kill em
[17:30:04] <akosiaris>	 so, something like helmfile destroy -l name=blah destroy
[17:30:20] <akosiaris>	 I am assuming you have only 1 release, but if it is all, you can skip the -l name=blah ofc
[17:30:39] <claime>	 Nah, it's mw-api-ext and stuff, but it's all borked anyways
[17:31:49] <claime>	 There we go
[17:33:42] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10akosiaris)
[17:38:59] <claime>	 ok mw-on-k8s is ok now
[17:39:05] <claime>	 think we can repool DNS ?
[17:40:17] <claime>	 akosiaris: ^
[17:41:08] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10Clement_Goubert)
[17:41:25] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10Clement_Goubert) Scale down persisted. https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/890882
[17:50:48] <akosiaris>	 claime: yes, please do
[17:50:58] <claime>	 repool thumbor too?
[17:51:21] <claime>	 ah no, it's inactive everywhere
[17:53:11] <akosiaris>	 No, repooling would cause issues right now
[17:53:20] <claime>	 ok
[18:08:00] <rzl>	 mutante: hey, I'm looking at https://gerrit.wikimedia.org/r/c/operations/puppet/+/527912 you sent me, and it looks like the underlying ticket is stalled because there's no community consensus -- did something change? same for at least some of the others, I haven't checked them all yet
[18:21:03] <mutante>	 rzl: well, what changed is that in general for years all those old tickets about renaming wikis/wiki aliases had been blocked and now they are not blocked anymore and I was basically saying "this is a valid language code and I added it to DNS". but that said, if this particular language "cbk" does not have consensus and that comment from 2016 is still valid.. then I take back the review 
[18:21:09] <mutante>	 request and would remove you
[18:21:35] <mutante>	 the other languages did not seem to have this comment that cbk does, let's remove that one then
[18:22:11] <mutante>	 removed you in gerrit
[18:26:08] <rzl>	 mutante: ah okay, I saw the others were stalled but I didn't realize the reason was different
[18:26:48] <rzl>	 several have comments like https://phabricator.wikimedia.org/T17988#3223979 saying they're blocked on tasks that are still open -- but it sounds like they're not blocked anymore, you're saying?
[18:27:52] <mutante>	 yea, so I heard that there is no more block on "aliasing" wikis
[18:28:07] <mutante>	 and based on that I merged a bunch of language names in DNS
[18:28:20] <mutante>	 after confirming they were valid language codes, ISO 839-3 
[18:28:50] <mutante>	 and then I saw the matching httpd config patches for them.. uploaded by the volunteer a long time ago
[18:29:02] <mutante>	 so that's why they showed up in review now
[18:29:35] <mutante>	 it was only to unblock them. .so it's possible to create wikis or redirects
[18:29:42] <mutante>	 both would come after DNS 
[18:30:23] <rzl>	 right
[18:30:37] <rzl>	 do you have a pointer for that blocker being resolved? I just can't find anything about it in phab
[18:31:01] <rzl>	 I see that https://phabricator.wikimedia.org/T172035 is still open, for *renaming* domains, but that doesn't strictly apply to aliases
[18:32:14] <mutante>	 hrmmm, valid request..
[18:32:18] <mutante>	 looks
[19:03:34] <mutante>	 rzl: not sure I have a very good answer, but I think it's that the "full renaming" seems still blocked as always but "alias the correct language name to redirect to the old wrong language name" has never been blocked and is just like "if we cant get the rename then at least do that". while the DNS changes unblocked both.. I realize that merging the httpd changes makes a decision there, which 
[19:03:40] <mutante>	 I am not asking you to make. so I should probably just retract my review requests 
[19:04:23] <mutante>	 which probably means it's effectively blocked again though, heh
[19:05:12] <mutante>	 even the "just at least redirect the correct name" part that is
[19:07:29] <mutante>	 well, there is something though that explains it: https://phabricator.wikimedia.org/T25216#3340841
[19:07:43] <mutante>	 that "I only suggested creating a CNAME domain name alias as a helper for a transition period (that should not exceed one year). During that time we'll fix all the remaining items "
[19:08:27] <rzl>	 haha that was 2017 so I'm not sure what to make of it now :P
[19:10:16] <mutante>	 hmm, yea, I think this is what happens every round 
[19:10:43] <rzl>	 anyway I'm happy to merge and deploy those changes if you get a good answer from somebody, let me know -- I'm just being cautious because it sounds like there at least used to be a lot of ways it can go wrong, and merging an old patch now might take everybody by surprise
[19:10:49] <mutante>	 I will remove where I added you.
[19:11:25] <rzl>	 thanks for bringing me in though, I appreciate the work
[19:11:55] <mutante>	 I can't ask you to get that consensus
[20:04:22] <wikibugs>	 10serviceops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10colewhite)
[21:46:17] <wikibugs>	 10serviceops, 10SRE, 10noc.wikimedia.org: make noc.wikimedia.org active/active (was: improve mw maintenance server switch over and discovery names) - https://phabricator.wikimedia.org/T265936 (10Aklapper)