[00:18:49] <wikibugs>	 10serviceops, 10Gerrit, 10SRE, 10serviceops-collab, and 2 others: replacement for gerrit2001, decom gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn)
[03:17:51] <wikibugs>	 10serviceops, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) >>! In T279664#8122731, @Joe wrote: > Do we expect that to happen regularly on a high percentage of requests? If 17% of all requests need to ma...
[03:47:16] <wikibugs>	 10serviceops, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) >>! In T279664#8123041, @MatthewVernon wrote: > Without that, I'm not sure what we can do to work around the fact that MW doesn't reliably writ...
[06:16:12] <jayme>	 inflatador: ryankemper: you need to bump the charts version in Chart.yaml for that change to be picked up
[06:40:53] <ryankemper>	 jayme: ah that makes sense, staging must have just had the latest version not applied yet thus why we saw a diff for that
[06:42:16] <ryankemper>	 thanks, we'll bump the version and deploy it properly tomorrow
[07:24:47] <_joe_>	 aoy serviceops people, we have some fresh racks to power down today
[07:25:40] <_joe_>	 I'll do rack D2
[07:25:44] <_joe_>	 :D
[07:27:41] <_joe_>	 jayme: see, I still have some manager in me ^^
[07:28:17] <jayme>	 I never had any doubts
[07:30:22] <jayme>	 I'm currently trying to figure out why we have such elevated list latency in k8s codfw since yesterday
[07:38:19] <_joe_>	 jayme: maybe moritz moved etcd to drbd for maintenance?
[07:38:32] <moritzm>	 currently only a staging etcd node
[07:38:41] <moritzm>	 the rest is all on plain disk storage
[07:38:50] <_joe_>	 ok then it must be something else
[07:38:59] <_joe_>	 jayme: did you try to restart the apiserver?
[07:43:19] <jayme>	 no. I was thinking in the direction of another reflector death loop
[07:43:34] <jayme>	 like https://phabricator.wikimedia.org/T303184
[07:43:59] <jayme>	 but did not find valid evidence by now.
[07:44:11] <jayme>	 need to run a quick errand...back in ~15min
[08:00:34] <_joe_>	 hnowlan: some servers that should go on maintenance today are "yours", I am specifically worried about maps2008
[08:00:49] <_joe_>	 I guess it should be in the "turn off just before" category
[08:04:23] <_joe_>	 jelto, jayme jokes apart, I have a quite busy day with at least one meeting I have to do homework for - I was supposed to do that yesterday but you know...
[08:05:05] <_joe_>	 so given the number of servers is smaller, I'd ask you to pick the prep work this morning of turning off all the servers we don't need to be turned on
[08:06:25] <jayme>	 _joe_: yes, sure. I'm going to get to that in 30-45min
[08:06:46] <jayme>	 thanks for completing D2 already :-p
[08:06:49] <_joe_>	 you have until the afternoon and it's like 10 servers, take it easy :)
[08:06:50] <jelto>	 I can help turning off the servers which don't have "just before" flag. 
[08:07:09] <_joe_>	 jelto: yeah I was asking for that :)
[08:07:25] <_joe_>	 thanks folks, you're amazing <3
[08:08:08] <jayme>	 jelto: can you shedule a downtime for all the not "just before" nodes until 19.00Z or so?
[08:09:04] <jelto>	 yes I'll do that
[08:10:19] <jayme>	 elukey: we have to take kafka-main2003 down today (rack C7 maintenance ~16.00Z). Current plan is to shut it down right before the maintenance starts and bring it back right after. Anything else to consider?
[08:10:39] <jayme>	 https://etherpad.wikimedia.org/p/F2pDHw25sm7eJS1Q3Tk2 ~ line 101
[08:10:41] <moritzm>	 I've now also switched kubestagetcd2002 back to "plain" disk storage
[08:15:31] <jelto>	 I'm starting with downtiming and power off the first machines for today which are not "just before". Starting with mc2047-48 (not in production)
[08:19:24] <elukey>	 jayme: o/ a clean stop of kafka + shutdown is enough, there will be some alarms about under replicated partitions but not a big deal. Two nodes at the same time will be tolerable but more risky (if anything else fails)
[08:32:44] <_joe_>	 elukey: can you add instructions on how to cleanly shutdown to https://etherpad.wikimedia.org/p/F2pDHw25sm7eJS1Q3Tk2 ?
[08:33:10] <_joe_>	 I'm not sure the people who will be doing that are here right now, given the time of the day when this should happen
[08:36:28] <elukey>	 ah yes it is just systemctl stop kafka, nothing big
[08:36:48] <elukey>	 it could be part of a clean shutdown, so maybe not needed
[08:48:48] <jayme>	 interesting pattern regarding LIST latency...
[08:49:05] <jayme>	 api request avg5m looks quite elevated (we alert on that one) https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27&from=1659516500000&to=now&var-datasource=thanos&var-site=codfw&var-cluster=k8s
[08:49:45] <jayme>	 but p99 latency of list calls has not increased: https://grafana-rw.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?from=1659516500000&to=now&var-datasource=thanos&var-site=codfw&var-cluster=k8s&orgId=1&var-verb=LIST&var-group=All&var-resource=secrets&forceLogin
[08:50:35] <jayme>	 situation immediately got better with jel.to draining kubernetes2022 as that re-sheduled helm-state-metrics: https://grafana-rw.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&from=1659516500000&to=now&var-datasource=codfw%20prometheus%2Fk8s&var-namespace=kube-system&var-pod=helm-state-metrics-c5594cd87-bfz55&var-pod=helm-state-metrics-c5594cd87-hjx46&var-pod=helm-state-metrics-c5594cd87-hrskg
[08:51:37] <jayme>	 I've seen increasing LIST times when helm-state-metrics was throttled (as it is kind of spiky because it needs do unpack all the helm release secrets unfortunately)
[08:52:17] <jayme>	 so probebly a noisy-neighbour scenario (as there is no throttling involved now - helm-state-metrics has no cpu limit)
[09:23:52] <jelto>	 When shutting down mw* hosts in codfw today, they don't need to be depooled individually because the service/discovery is depooled for codfw atm, is this correct :)? (for example confctl --object-type discovery select 'dnsdisc=appserver.*' get)
[09:25:06] <_joe_>	 no
[09:25:17] <_joe_>	 :P
[09:25:37] <_joe_>	 they need to be set to inactive so that they don't appear in the scap sync list
[09:26:28] <_joe_>	 actually, I think we should add two systemd timers, one on startup that after network is up does a scap pull 
[09:26:38] <_joe_>	 and one that on shutdown runs "sudo decommission"
[09:26:51] <jelto>	 _joe_: ah yes I remember. And this was done by the "decommision" alias? I was looking in SAL and could not find anything from yesterday
[09:26:57] <_joe_>	 yes
[09:30:17] <jelto>	 ok thanks. I'll proceed with mw2271-79
[09:42:51] <hnowlan>	 _joe_: maps should be fine, some restbase though
[09:43:01] <_joe_>	 yeah...
[09:43:17] <hnowlan>	 they're in the same rack which should be fine, it's more hassle to downtime/shut them down in advance than it's worth
[09:43:26] <hnowlan>	 but I have an eye on it
[09:47:44] <jelto>	 fyi: I need to go afk in around 1h for 1h. I'll be back before pdu swap starts (so back at ~12:00Z)
[09:49:48] <_joe_>	 jelto: thanks for all the help
[10:03:44] <btullis>	 Hello, I wonder if someone could clarify something for me please - about bootstrapping a new etcd cluster.
[10:04:49] <btullis>	 What's the current state of things with regard to the tlsproxy part of it? Wikitech says:
[10:04:55] <btullis>	 https://www.irccloud.com/pastebin/Ko3nccim/
[10:05:46] <_joe_>	 btullis: is that a cluster for kubernetes?
[10:05:52] <btullis>	 ...but as far as I can see that parameter no longer exists.
[10:06:08] <btullis>	 Yes, it's for a new kubernetes cluster
[10:06:48] <btullis>	 Re: https://phabricator.wikimedia.org/T310196
[10:07:59] <_joe_>	 btullis: yeah now we have   profile::etcd::tlsproxy if you want to add the proxy
[10:08:07] <_joe_>	 but for k8s IIRC we do not do that
[10:08:32] <_joe_>	 it would be quite useless as access control is based on IP address rather than on RBAC
[10:08:58] <_joe_>	 the thing that is very slow in etcd is the authn/authz function *for the v2 api*
[10:09:52] <_joe_>	 btullis: I think you should basically use role::etcd::v3::kubernetes
[10:10:43] <_joe_>	 but you'll need to create a separate role I think, like elukey has done for the ML cluster, so that it's easier to manage hiera
[10:10:59] <_joe_>	 (his is     role::etcd::v3::ml_etcd )
[10:11:34] <_joe_>	 btullis: which wikitech page were you reading?
[10:11:39] <_joe_>	 I should clarify it
[10:12:06] <btullis>	 OK cool, thanks. Yes, that's what I've done. role::etcd::v3::dse_k8s_etcd (based on the ml_etcd as you mentioned)
[10:12:25] <btullis>	 Sample code and instructions here: https://wikitech.wikimedia.org/wiki/Etcd#Bootstrapping_an_etcd_cluster
[10:13:09] <btullis>	 I can try to update it, but I thought I'd like to clarify so I don't botch the instructions for someone else :-)
[10:13:23] <_joe_>	 ah sigh yes that's for a *main* etcd cluster, apologies it's clearly not been updated since we have k8s etcds
[10:14:15] <_joe_>	 ok let me first add a note there at the top
[10:14:19] <btullis>	 All good, thanks for the explanation. 
[10:17:03] <btullis>	 _joe_: Would you mind if I add you as a reviewer for the CRs to bootstrap it? They shouldn't affect anything outside of this cluster, but I haven't bootstrapped an etcd cluster before.
[10:17:33] <jayme>	 preparable hosts for today are all shut down
[10:17:36] <_joe_>	 btullis: sure, I'm quite rusty on bootstrapping a cluster but I should be able to help
[10:17:41] <_joe_>	 jayme: <3
[10:17:54] <_joe_>	 we'll follow the mainteance in the afternoon
[10:26:41] <jelto>	 jayme: thanks!
[14:02:23] <wikibugs>	 10serviceops, 10Maps: Re-import full planet data into eqiad - https://phabricator.wikimedia.org/T314472 (10JMcLeod_WMF)
[14:03:25] <wikibugs>	 10serviceops, 10Maps: Re-import full planet data into eqiad - https://phabricator.wikimedia.org/T314472 (10JMcLeod_WMF) p:05Triage→03Medium
[14:56:05] <jelto>	 I'm going to power off mc2030 and mc2031 
[14:59:56] <jayme>	 thanks!
[15:02:11] <rzl>	 sorry I'm going to be unavailable to help with the maintenance today until about 20 UTC -- mutante will you be able to handle the serviceops steps?
[15:09:35] <_joe_>	 jelto: ah you're here
[15:09:36] <_joe_>	 :D
[15:09:45] <_joe_>	 we should really all coordinate in #sre
[15:09:50] <jelto>	 joe: I already powered down mc2030-31. 
[15:09:55] <jelto>	 okay
[15:12:52] <jayme>	 suggested -sre yesterday already - would make it easier not having to check 3 channels
[15:13:05] <_joe_>	 jayme: let's all just communicate there
[15:13:08] <_joe_>	 it will be contagious
[15:13:13] <_joe_>	 hopefully
[15:14:20] <_joe_>	 hnowlan: we're now powering off C5
[15:14:40] <_joe_>	 should we turn off restbase2020 and 2025?
[15:15:10] <hnowlan>	 yep, draining them now 
[15:15:32] <hnowlan>	 mentioned in #wikimedia-dcops 
[16:07:05] <jelto>	 jayme, _joe_: I'd like to leave in around 10 minutes. Does that work for you?  
[16:07:19] <_joe_>	 jelto: go on
[16:07:26] <jayme>	 yes
[16:07:38] <jayme>	 I can assist from here
[16:07:42] <jelto>	 thanks <3 
[17:28:56] <mutante>	 one more thing: "PYBAL CRITICAL - CRITICAL - git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled"  looking at that next ;/
[17:29:16] <mutante>	 it's a race about listening on v6 IP or so.. 
[17:34:35] <mutante>	 fixed with 'systemctl restart ssh-phab' on phab2001. RECOVERY - PyBal backends health check on lvs2008 is OK: PYBAL OK 
[18:41:53] <mutante>	 all mw servers in D3 scap pulled / pooled again / removed downtimes
[20:26:57] <wikibugs>	 10serviceops, 10serviceops-collab, 10GitLab (Infrastructure), 10Patch-For-Review: Document and test failover for GitLab and GitLab Replica - https://phabricator.wikimedia.org/T296713 (10Arnoldokoth) ` # host gitlab-replica-old.wikimedia.org gitlab-replica-old.wikimedia.org has address 208.80.154.15 gitlab-...
[20:29:29] <wikibugs>	 10serviceops, 10serviceops-collab, 10GitLab (Infrastructure), 10Patch-For-Review: Document and test failover for GitLab and GitLab Replica - https://phabricator.wikimedia.org/T296713 (10Arnoldokoth) @Jelto I managed to switch over the hosts i.e. gitlab1003 and gitlab2002. The current replica is now gitlab2...
[21:09:03] <ryankemper>	 Having some trouble doing a helmfile apply of `changeprop-jobqueue` to `codfw`; deployment's replicaset failed to bring up the desired 30 total pods because of quotas, see this event output from the `kubectl describe rs`:
[21:09:46] <ryankemper>	 https://www.irccloud.com/pastebin/kbtLsT92/kubectl%20describe%20rs%20changeprop-production-7c8c5fc4f4%20
[21:11:22] <ryankemper>	 I imagine I just need to try again at another time and get luckier on the pod:node distribution. I'm curious how much memory these pods actually use though, maybe there's some room to relax the requests or something
[21:13:39] <ryankemper>	 Also the `changeprop-production` deploy is set to 25% unavailable and 25% max surge, so another option might be to fiddle with those parameters a bit, although that's probably not as good an option
[21:42:14] <wikibugs>	 10serviceops, 10serviceops-collab, 10GitLab (Infrastructure), 10Patch-For-Review: Document and test failover for GitLab and GitLab Replica - https://phabricator.wikimedia.org/T296713 (10Arnoldokoth) Thanks to @Dzahn the SSL issue is now resolved.   I did have a question though, what happens to the old repl...