[00:18:49] 10serviceops, 10Gerrit, 10SRE, 10serviceops-collab, and 2 others: replacement for gerrit2001, decom gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn) [03:17:51] 10serviceops, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) >>! In T279664#8122731, @Joe wrote: > Do we expect that to happen regularly on a high percentage of requests? If 17% of all requests need to ma... [03:47:16] 10serviceops, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) >>! In T279664#8123041, @MatthewVernon wrote: > Without that, I'm not sure what we can do to work around the fact that MW doesn't reliably writ... [06:16:12] inflatador: ryankemper: you need to bump the charts version in Chart.yaml for that change to be picked up [06:40:53] jayme: ah that makes sense, staging must have just had the latest version not applied yet thus why we saw a diff for that [06:42:16] thanks, we'll bump the version and deploy it properly tomorrow [07:24:47] <_joe_> aoy serviceops people, we have some fresh racks to power down today [07:25:40] <_joe_> I'll do rack D2 [07:25:44] <_joe_> :D [07:27:41] <_joe_> jayme: see, I still have some manager in me ^^ [07:28:17] I never had any doubts [07:30:22] I'm currently trying to figure out why we have such elevated list latency in k8s codfw since yesterday [07:38:19] <_joe_> jayme: maybe moritz moved etcd to drbd for maintenance? [07:38:32] currently only a staging etcd node [07:38:41] the rest is all on plain disk storage [07:38:50] <_joe_> ok then it must be something else [07:38:59] <_joe_> jayme: did you try to restart the apiserver? [07:43:19] no. I was thinking in the direction of another reflector death loop [07:43:34] like https://phabricator.wikimedia.org/T303184 [07:43:59] but did not find valid evidence by now. [07:44:11] need to run a quick errand...back in ~15min [08:00:34] <_joe_> hnowlan: some servers that should go on maintenance today are "yours", I am specifically worried about maps2008 [08:00:49] <_joe_> I guess it should be in the "turn off just before" category [08:04:23] <_joe_> jelto, jayme jokes apart, I have a quite busy day with at least one meeting I have to do homework for - I was supposed to do that yesterday but you know... [08:05:05] <_joe_> so given the number of servers is smaller, I'd ask you to pick the prep work this morning of turning off all the servers we don't need to be turned on [08:06:25] _joe_: yes, sure. I'm going to get to that in 30-45min [08:06:46] thanks for completing D2 already :-p [08:06:49] <_joe_> you have until the afternoon and it's like 10 servers, take it easy :) [08:06:50] I can help turning off the servers which don't have "just before" flag. [08:07:09] <_joe_> jelto: yeah I was asking for that :) [08:07:25] <_joe_> thanks folks, you're amazing <3 [08:08:08] jelto: can you shedule a downtime for all the not "just before" nodes until 19.00Z or so? [08:09:04] yes I'll do that [08:10:19] elukey: we have to take kafka-main2003 down today (rack C7 maintenance ~16.00Z). Current plan is to shut it down right before the maintenance starts and bring it back right after. Anything else to consider? [08:10:39] https://etherpad.wikimedia.org/p/F2pDHw25sm7eJS1Q3Tk2 ~ line 101 [08:10:41] I've now also switched kubestagetcd2002 back to "plain" disk storage [08:15:31] I'm starting with downtiming and power off the first machines for today which are not "just before". Starting with mc2047-48 (not in production) [08:19:24] jayme: o/ a clean stop of kafka + shutdown is enough, there will be some alarms about under replicated partitions but not a big deal. Two nodes at the same time will be tolerable but more risky (if anything else fails) [08:32:44] <_joe_> elukey: can you add instructions on how to cleanly shutdown to https://etherpad.wikimedia.org/p/F2pDHw25sm7eJS1Q3Tk2 ? [08:33:10] <_joe_> I'm not sure the people who will be doing that are here right now, given the time of the day when this should happen [08:36:28] ah yes it is just systemctl stop kafka, nothing big [08:36:48] it could be part of a clean shutdown, so maybe not needed [08:48:48] interesting pattern regarding LIST latency... [08:49:05] api request avg5m looks quite elevated (we alert on that one) https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27&from=1659516500000&to=now&var-datasource=thanos&var-site=codfw&var-cluster=k8s [08:49:45] but p99 latency of list calls has not increased: https://grafana-rw.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?from=1659516500000&to=now&var-datasource=thanos&var-site=codfw&var-cluster=k8s&orgId=1&var-verb=LIST&var-group=All&var-resource=secrets&forceLogin [08:50:35] situation immediately got better with jel.to draining kubernetes2022 as that re-sheduled helm-state-metrics: https://grafana-rw.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&from=1659516500000&to=now&var-datasource=codfw%20prometheus%2Fk8s&var-namespace=kube-system&var-pod=helm-state-metrics-c5594cd87-bfz55&var-pod=helm-state-metrics-c5594cd87-hjx46&var-pod=helm-state-metrics-c5594cd87-hrskg [08:51:37] I've seen increasing LIST times when helm-state-metrics was throttled (as it is kind of spiky because it needs do unpack all the helm release secrets unfortunately) [08:52:17] so probebly a noisy-neighbour scenario (as there is no throttling involved now - helm-state-metrics has no cpu limit) [09:23:52] When shutting down mw* hosts in codfw today, they don't need to be depooled individually because the service/discovery is depooled for codfw atm, is this correct :)? (for example confctl --object-type discovery select 'dnsdisc=appserver.*' get) [09:25:06] <_joe_> no [09:25:17] <_joe_> :P [09:25:37] <_joe_> they need to be set to inactive so that they don't appear in the scap sync list [09:26:28] <_joe_> actually, I think we should add two systemd timers, one on startup that after network is up does a scap pull [09:26:38] <_joe_> and one that on shutdown runs "sudo decommission" [09:26:51] _joe_: ah yes I remember. And this was done by the "decommision" alias? I was looking in SAL and could not find anything from yesterday [09:26:57] <_joe_> yes [09:30:17] ok thanks. I'll proceed with mw2271-79 [09:42:51] _joe_: maps should be fine, some restbase though [09:43:01] <_joe_> yeah... [09:43:17] they're in the same rack which should be fine, it's more hassle to downtime/shut them down in advance than it's worth [09:43:26] but I have an eye on it [09:47:44] fyi: I need to go afk in around 1h for 1h. I'll be back before pdu swap starts (so back at ~12:00Z) [09:49:48] <_joe_> jelto: thanks for all the help [10:03:44] Hello, I wonder if someone could clarify something for me please - about bootstrapping a new etcd cluster. [10:04:49] What's the current state of things with regard to the tlsproxy part of it? Wikitech says: [10:04:55] https://www.irccloud.com/pastebin/Ko3nccim/ [10:05:46] <_joe_> btullis: is that a cluster for kubernetes? [10:05:52] ...but as far as I can see that parameter no longer exists. [10:06:08] Yes, it's for a new kubernetes cluster [10:06:48] Re: https://phabricator.wikimedia.org/T310196 [10:07:59] <_joe_> btullis: yeah now we have profile::etcd::tlsproxy if you want to add the proxy [10:08:07] <_joe_> but for k8s IIRC we do not do that [10:08:32] <_joe_> it would be quite useless as access control is based on IP address rather than on RBAC [10:08:58] <_joe_> the thing that is very slow in etcd is the authn/authz function *for the v2 api* [10:09:52] <_joe_> btullis: I think you should basically use role::etcd::v3::kubernetes [10:10:43] <_joe_> but you'll need to create a separate role I think, like elukey has done for the ML cluster, so that it's easier to manage hiera [10:10:59] <_joe_> (his is role::etcd::v3::ml_etcd ) [10:11:34] <_joe_> btullis: which wikitech page were you reading? [10:11:39] <_joe_> I should clarify it [10:12:06] OK cool, thanks. Yes, that's what I've done. role::etcd::v3::dse_k8s_etcd (based on the ml_etcd as you mentioned) [10:12:25] Sample code and instructions here: https://wikitech.wikimedia.org/wiki/Etcd#Bootstrapping_an_etcd_cluster [10:13:09] I can try to update it, but I thought I'd like to clarify so I don't botch the instructions for someone else :-) [10:13:23] <_joe_> ah sigh yes that's for a *main* etcd cluster, apologies it's clearly not been updated since we have k8s etcds [10:14:15] <_joe_> ok let me first add a note there at the top [10:14:19] All good, thanks for the explanation. [10:17:03] _joe_: Would you mind if I add you as a reviewer for the CRs to bootstrap it? They shouldn't affect anything outside of this cluster, but I haven't bootstrapped an etcd cluster before. [10:17:33] preparable hosts for today are all shut down [10:17:36] <_joe_> btullis: sure, I'm quite rusty on bootstrapping a cluster but I should be able to help [10:17:41] <_joe_> jayme: <3 [10:17:54] <_joe_> we'll follow the mainteance in the afternoon [10:26:41] jayme: thanks! [14:02:23] 10serviceops, 10Maps: Re-import full planet data into eqiad - https://phabricator.wikimedia.org/T314472 (10JMcLeod_WMF) [14:03:25] 10serviceops, 10Maps: Re-import full planet data into eqiad - https://phabricator.wikimedia.org/T314472 (10JMcLeod_WMF) p:05Triage→03Medium [14:56:05] I'm going to power off mc2030 and mc2031 [14:59:56] thanks! [15:02:11] sorry I'm going to be unavailable to help with the maintenance today until about 20 UTC -- mutante will you be able to handle the serviceops steps? [15:09:35] <_joe_> jelto: ah you're here [15:09:36] <_joe_> :D [15:09:45] <_joe_> we should really all coordinate in #sre [15:09:50] joe: I already powered down mc2030-31. [15:09:55] okay [15:12:52] suggested -sre yesterday already - would make it easier not having to check 3 channels [15:13:05] <_joe_> jayme: let's all just communicate there [15:13:08] <_joe_> it will be contagious [15:13:13] <_joe_> hopefully [15:14:20] <_joe_> hnowlan: we're now powering off C5 [15:14:40] <_joe_> should we turn off restbase2020 and 2025? [15:15:10] yep, draining them now [15:15:32] mentioned in #wikimedia-dcops [16:07:05] jayme, _joe_: I'd like to leave in around 10 minutes. Does that work for you? [16:07:19] <_joe_> jelto: go on [16:07:26] yes [16:07:38] I can assist from here [16:07:42] thanks <3 [17:28:56] one more thing: "PYBAL CRITICAL - CRITICAL - git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled" looking at that next ;/ [17:29:16] it's a race about listening on v6 IP or so.. [17:34:35] fixed with 'systemctl restart ssh-phab' on phab2001. RECOVERY - PyBal backends health check on lvs2008 is OK: PYBAL OK [18:41:53] all mw servers in D3 scap pulled / pooled again / removed downtimes [20:26:57] 10serviceops, 10serviceops-collab, 10GitLab (Infrastructure), 10Patch-For-Review: Document and test failover for GitLab and GitLab Replica - https://phabricator.wikimedia.org/T296713 (10Arnoldokoth) ` # host gitlab-replica-old.wikimedia.org gitlab-replica-old.wikimedia.org has address 208.80.154.15 gitlab-... [20:29:29] 10serviceops, 10serviceops-collab, 10GitLab (Infrastructure), 10Patch-For-Review: Document and test failover for GitLab and GitLab Replica - https://phabricator.wikimedia.org/T296713 (10Arnoldokoth) @Jelto I managed to switch over the hosts i.e. gitlab1003 and gitlab2002. The current replica is now gitlab2... [21:09:03] Having some trouble doing a helmfile apply of `changeprop-jobqueue` to `codfw`; deployment's replicaset failed to bring up the desired 30 total pods because of quotas, see this event output from the `kubectl describe rs`: [21:09:46] https://www.irccloud.com/pastebin/kbtLsT92/kubectl%20describe%20rs%20changeprop-production-7c8c5fc4f4%20 [21:11:22] I imagine I just need to try again at another time and get luckier on the pod:node distribution. I'm curious how much memory these pods actually use though, maybe there's some room to relax the requests or something [21:13:39] Also the `changeprop-production` deploy is set to 25% unavailable and 25% max surge, so another option might be to fiddle with those parameters a bit, although that's probably not as good an option [21:42:14] 10serviceops, 10serviceops-collab, 10GitLab (Infrastructure), 10Patch-For-Review: Document and test failover for GitLab and GitLab Replica - https://phabricator.wikimedia.org/T296713 (10Arnoldokoth) Thanks to @Dzahn the SSL issue is now resolved. I did have a question though, what happens to the old repl...