[06:13:13] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10Joe) @wiki_willy @Jclark-ctr even if the task is stalled, just to make sure: these servers are still in rotation, Please do not decommission them until we've removed them. We need to resolve... [06:14:42] 10serviceops: Put mw14[57-98] in production - https://phabricator.wikimedia.org/T313327 (10Joe) a:03Joe [06:17:00] 10serviceops, 10DBA, 10WMF-General-or-Unknown: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Marostegui) What's the status of this? Is this fixed? [09:53:40] folks there seems to be an alert for high LIST latencies for k8s eqiad [09:54:28] like [09:54:29] "List" url:/api/v1/secrets (started: 2022-11-23 09:43:40.889726446 +0000 UTC m=+420248.747933016) (total time: 2.48976052s) [09:57:10] elukey: looking [09:57:46] don't really know what I can do, but I'm looking at the graphs [09:59:00] claime: I just added two new graphs to the UI, for p75 and p95 [09:59:26] Oh thanks [10:00:29] the alert seems to indicate p95 afaics from the repo [10:01:04] ah ok something may have changed, I see this comment [10:01:13] yeah...I very much regret working on that already :) [10:01:53] # This alert excludes secrets. :D [10:02:06] jayme: It looks like the duration varies a lot, but also there's almost no requests for the LIST [10:02:08] there is one that is just for secrets [10:02:50] ah okok [10:03:28] https://grafana-rw.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?from=now-2d&to=now&var-datasource=thanos&var-site=eqiad&var-cluster=k8s&orgId=1&var-verb=LIST&var-group=All&var-resource=secrets&var-split_by_instance= [10:03:45] started yesterday around 12.00 [10:04:33] this "usually" is the thing I created to monitor helm releases :/ [10:05:15] p95 seems more spiky, p99 is different, smoother [10:05:38] the alert should be based on 95 right? (maybe we can add the panel to the alert's definition for clarity) [10:05:40] elukey: The thing is, there's max 1r/s [10:06:14] (on that particular endpoint) [10:06:37] yeah, it's not much usually [10:07:07] general cpu usage of the apiservers increased around the same time as well [10:07:21] claime: ah yeah it may be the kube-apiserver in need of a restart, I've done a similar thing this morning for ml (not for secrets, it was the admission webhook). Since we are moving to 1.23 I'd prefer not to be dragged into debugging go code for hours if there is a quick fix :D [10:07:32] elukey: for sure [10:07:58] IIRC there was a mw deployment around that time yesterday? [10:08:06] Yeah [10:08:21] We did quite a few tests of mw deployments on wikikube too [10:09:44] I don't really see how it would increase load on LIST secrets though [10:10:43] helm-state-metrics is basically doing "helm list" when getting scraped by prometheus. That in turn lists all helm secrets in all namespaces [10:11:25] as those secrets are rather big (they contain the whole chart IIRC) there is a long tail in delivering those back to helm-state-metrics [10:11:32] Right. But the services have been deployed for a while [10:11:37] could it be that the apis are simply backlogged ? [10:11:44] for $bug or similar [10:11:49] The only thing that changed is that we tested deploying through scap [10:12:06] (and we're not currently doing that) [10:12:25] also, as helm-state-metrics needs to untar all those secrets, it takes some time for it to consume them which I think also slows down the measurement of the request on the apiserver side [10:12:38] ack [10:14:23] I've seen something like that happening when helm-state-metrics suffered from a noisy neighbour [10:15:31] let's see if killing the pod makes a difference (especially on CPU usage of the apiserver) [10:15:41] jayme: but we do see 2s+ latencies on the kube api logs, I suppose that helm-state-metrics untars etc.. right after the call [10:15:53] yes [10:16:08] ah wait it is in a pod [10:16:17] hm? [10:16:24] helm-state-metrics is, yes [10:16:37] didn't know about it [10:16:57] anyway, killing it seems a good test, +1 [10:17:08] ok. killed [10:28:02] Well it recovered the alert at least [10:28:47] claime: K8s senses when Janis is upset [10:28:56] x) [10:29:11] Kill a pod for the example, alert recovers [10:29:24] Scare your servers into working correctly [10:31:41] The values are not really back to normal, so I would assume the alert will fire again shortly [10:34:41] Yeah, request time nearly doubled since yesterday... but again I would have assumed it would have started being longer when we created the namespaces and deployed the services (since that's when the rise in number of charts to unpack should have happened, iiuc) [10:35:38] yeah. but every helm release increses the number of secrets as well [10:35:59] effie: have you taken any manual actions as regards the postgres replicas? I know the import is in progress, but the recovery from the lag makes me think the slots are working [10:36:10] well...every helmfile apply to be precise [10:36:20] hnowlan: none at all! [10:36:27] nice <3 [10:37:01] :D:D [10:37:15] jayme: Oh. That's going to become a problem then [10:38:14] Aaaand it's back [10:39:03] to some extend to be fair. we have a helm history limit of 10. so you can expect releases * 10 secrets to be around in the cluster [10:40:25] IIRC the mw-* deployments create 2 releases each which should have upped the number of secrets quite a bit...but that should have been the case the last couple of days already. Not since yesterday [10:40:27] That makes at least 2 * services * 10 secrets [10:40:29] Yep [10:40:51] Well tbf I'm not sure we did more than 2 releases on each before yesterday [10:41:24] So we may have actually filled up the history for all these releases yesterday by doing the scap tests [10:47:27] We created 27 history entries yesterday, so 270 secrets [10:48:01] ah, okay [10:48:18] so there where more than one deployment, right? [10:48:28] *was [10:48:38] According to the history we did 3 deployment tests yeah [10:49:35] 21 new secrets but yeah [10:50:14] still quite an increase for only 21 [10:50:44] Right, my bad [11:01:53] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Make mw-web and mw-api-ext available behind LVS - https://phabricator.wikimedia.org/T323621 (10Clement_Goubert) p:05Triage→03Medium [11:32:49] effie: keeping an eye on https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=maps1009&var-datasource=thanos&var-cluster=maps&from=1669192288492&to=1669203088492&viewPanel=6 - imposm is currently creating indices which will obviously use some temporary disk but the wal storage will also build up so we'll see even more usage [11:33:21] nothing to panic about yet but good to get a baseline on [11:34:47] claime: I'll revisit this when I feel better. Wanted to ack the alert but it is currently not firing. If it comes back, feel free to ack it until next week. It should not slow down overal cluster operations as it is limited to just the calls from helm-state-metrics [11:35:12] jayme: alright, take care [11:58:19] hnowlan: alright, cc nemo-yiannis ^ [13:32:47] hnowlan: Its kinda scary that disk usage was > 90% while indexing. Is there something we can do as a mitigation for future imports? Planet size only gets bigger over time [14:02:45] 10serviceops: wikikube LIST secrets latency - https://phabricator.wikimedia.org/T323706 (10Clement_Goubert) [14:19:59] 10serviceops, 10Infrastructure-Foundations, 10SRE-tools: httpbb random read timeout on cumin2002 - https://phabricator.wikimedia.org/T323707 (10Volans) p:05Triage→03Medium [14:20:29] 10serviceops: httpbb random read timeout on cumin2002 - https://phabricator.wikimedia.org/T323707 (10Volans) [14:32:11] 10serviceops, 10API Platform (Sprint 01), 10Platform Team Workboards (Platform Engineering Reliability): New Service Request uniqueDevices Endpoint: AQS 2.0 - https://phabricator.wikimedia.org/T320967 (10Atieno) [14:45:18] nemo-yiannis: step #0 is get rid of tilerator and cassandra. I have that huge CR open for it but I need to revisit it, removing all tilerator references might be best done piecemeal rather than all at once :/ [15:01:29] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) [15:01:39] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Make mw-web and mw-api-ext available behind LVS - https://phabricator.wikimedia.org/T323621 (10Clement_Goubert) 05In progress→03Resolved [15:35:20] 10serviceops, 10DBA, 10WMF-General-or-Unknown: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Ladsgroup) 05Open→03Resolved With splitting based on sharding and some other fixes (T322849), this has been properly mitigated. In order to avoid lingering tasks, I close this and... [16:24:58] jayme: as you predicted, limits are a bit of a hiccup for thumbor. I think we can reasonably squeeze things *just* into the space in this CR though if you have a minute https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/860072 [16:27:41] hnowlan: j.ayme is not well, do you need more than a cursory review and +1? [16:28:04] Then again it's lowering values, so not like it's going to cause capacity issues for the cluster [16:30:36] claime: nothing too involved, no! :) as it stands it's not running in prod because it hits the limits [16:35:01] +1'd then :) [16:40:03] thanks! [18:32:22] https://sysdig.com/blog/analysis-of-supply-chain-attacks-through-public-docker-images/ [18:32:39] Just started reading it. [18:32:57] Still cryptomining I see [19:03:47] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Papaul) [19:07:22] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Papaul) [19:29:24] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host arclamp1001.eqiad.wmnet with OS bullseye [20:41:50] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host arclamp1001.eqiad.wmnet with OS bullseye executed with errors: - arclamp1001... [20:52:26] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host arclamp1001.eqiad.wmnet with OS bullseye [21:54:38] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host arclamp1001.eqiad.wmnet with OS bullseye executed with errors: - arclamp1001...