[07:51:20] eoghan: jelto: good morning, I could use a review for a Docker image based on Bullseye for building python 2 app (that is for Zuul) https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/940161 [07:51:43] there is a series of patches following which I need to rework, but I could at least use that new image :) [08:01:59] I'll take a look later today [08:21:46] thanks :) [09:59:10] 10serviceops, 10MW-on-K8s: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228 (10fgiunchedi) [10:12:23] 10serviceops, 10Parsoid: Spurious/unactionable mw latency exceeded for parsoid - https://phabricator.wikimedia.org/T348231 (10fgiunchedi) [10:12:42] 10serviceops, 10Parsoid: Spurious/unactionable MediaWikiLatencyExceeded alert exceeded for parsoid - https://phabricator.wikimedia.org/T348231 (10fgiunchedi) [10:39:11] godog: I'm checking that KubernetesAPILatency stuff and oh boy, look at those etcd latencies https://grafana.wikimedia.org/goto/2-G6hiGSk?orgId=1 [10:43:20] claime: oh wow, yeah that is not an happy camper [10:43:33] I think they're undersized [10:43:47] 1 vCPU is maybe a little short [10:44:41] for etcd? could be yeah [10:46:41] Also 150ms fsyncs [10:47:05] Baseline of 50ms already seems high [11:08:01] I think it may be disk related, that's usually the main bottleneck for etcd, but it couldn't hurt to give them a bit more cpu either [11:13:44] +1 [11:13:47] going to lunch [11:13:53] ack [11:14:03] Do you know if there are any levers for disk perf on ganeti vms? [11:14:18] not offhand no [11:14:39] I'd imagine something drbd related tho, if anything [11:14:44] yeah [11:16:58] Oh wait, we don't need drbd for etcd [11:17:09] If that's not already the case we could move to local storage [11:20:56] They're already local storage :'( [11:23:34] 10serviceops, 10MW-on-K8s: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228 (10Clement_Goubert) Looks like there's some correlation with very bad etcd latency {F37986713} (Confusingly, that graph gives the latency by caller, not by destination) Looking at etcd metric... [11:34:46] 10serviceops, 10MW-on-K8s: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228 (10Clement_Goubert) The masters themselves only have 1 vCPU and seem a little undersized memory-wise (ex kubemaster1002) {F37986808} {F37986810} I think we've got the capacity to grow them a... [11:36:22] 10serviceops, 10Prod-Kubernetes: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228 (10Clement_Goubert) p:05Triage→03Medium a:03Clement_Goubert [11:38:45] 10serviceops, 10Parsoid: Spurious/unactionable MediaWikiLatencyExceeded alert exceeded for parsoid - https://phabricator.wikimedia.org/T348231 (10Clement_Goubert) a:03Clement_Goubert [11:43:18] something is happening to parsoid though and this time it's not just it being slow [11:44:08] Mean jumped from under 1s to over 5 https://grafana.wikimedia.org/goto/Z7nExiGSk?orgId=1 [11:48:09] Looks like it can't reach arclamp's redis [11:48:51] That's not the cause, it's called because of the timeout [11:51:03] Something's going on with parsoid and commons [11:51:19] https://grafana.wikimedia.org/goto/4dOW-mGSk?orgId=1 [14:41:21] 10serviceops, 10Prod-Kubernetes: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2578e1de-95e1-47af-871c-5fc14c29acc0) set by cgoubert@cumin1001 for 0:15:00 on 1 host(s) and their services with reaso... [14:45:01] 10serviceops, 10Prod-Kubernetes: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=4ee976bc-2144-44ed-a885-1325a5720050) set by cgoubert@cumin1001 for 0:15:00 on 1 host(s) and their services with reaso... [14:47:22] 10serviceops, 10Prod-Kubernetes: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e5543a4c-cba1-4246-8253-ffe15dcde108) set by cgoubert@cumin1001 for 0:15:00 on 1 host(s) and their services with reaso... [14:53:18] 10serviceops, 10Prod-Kubernetes: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1c916c69-a73b-4df2-ab3f-a877744bdad0) set by cgoubert@cumin1001 for 0:15:00 on 1 host(s) and their services with reaso... [14:59:56] 10serviceops, 10Prod-Kubernetes: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7031ccc2-0805-4db8-aa97-55d719b4d006) set by cgoubert@cumin1001 for 0:15:00 on 1 host(s) and their services with reaso... [15:07:36] 10serviceops, 10Prod-Kubernetes: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d85168b6-bad9-4157-b743-a5047d55be03) set by cgoubert@cumin1001 for 0:15:00 on 1 host(s) and their services with reaso... [15:10:05] 10serviceops, 10Prod-Kubernetes: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2b7029e7-cb66-4d09-a3ae-01b8726b028e) set by cgoubert@cumin1001 for 0:15:00 on 1 host(s) and their services with reaso... [15:13:06] 10serviceops, 10Prod-Kubernetes: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=52db74f7-7d98-4a04-8418-376cc7ab1496) set by cgoubert@cumin1001 for 0:15:00 on 1 host(s) and their services with reaso... [15:20:16] 10serviceops, 10Prod-Kubernetes: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ae3368e2-fef2-4be8-b17e-8e3d2bb121cd) set by cgoubert@cumin1001 for 0:15:00 on 1 host(s) and their services with reaso... [15:26:08] 10serviceops, 10Prod-Kubernetes: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=91da260a-99b3-46b1-b016-fe2edefaff1e) set by cgoubert@cumin1001 for 0:15:00 on 1 host(s) and their services with reaso... [15:33:07] 10serviceops, 10Prod-Kubernetes: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228 (10Clement_Goubert) Bumped all kubetcd to 2 vcpu and all kubemasters to 2 vcpu ang 4G ram. If that isn't sufficient, we may need to think about migrating to real hardware for these, beca... [15:59:22] godog: I've bumped the resources, we'll see if it keeps alerting as often or not [16:19:39] 10serviceops, 10Prod-Kubernetes: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228 (10Clement_Goubert) It's a bit better, but still borderline, as the alert is 5 minutes over .5s and we're basically just under the time threshold. {F37987608} etcd latency is a lot bet... [21:10:58] 10serviceops, 10Wikimedia-Site-requests, 10Campaign-Registration, 10Campaign-Tools (Campaign-Tools-Current-Sprint): Configure the aggregation job to run periodically on Wikimedia wikis - https://phabricator.wikimedia.org/T339984 (10Daimona) 05Stalled→03Open Now testable in beta. [22:37:33] 10serviceops, 10MW-on-K8s: Handle sidecar containers in one-off Kubernetes jobs - https://phabricator.wikimedia.org/T348284 (10RLazarus) [22:37:43] 10serviceops, 10MW-on-K8s: Handle sidecar containers in one-off Kubernetes jobs - https://phabricator.wikimedia.org/T348284 (10RLazarus) p:05Triage→03Medium