[06:52:48] 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review: Better scaffolding for helm charts / releases - https://phabricator.wikimedia.org/T292818 (10Joe) 05Open→03Resolved [07:29:18] 10serviceops, 10SRE-OnFire, 10Traffic, 10conftool, 10Sustainability (Incident Followup): Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 (10Joe) I'm frankly not sure how checking appserver.svc.eaqiad.wmnet:9090 from... [09:14:25] 10serviceops, 10Kubernetes, 10Patch-For-Review: Add a second control-plane to wikikube staging clusters - https://phabricator.wikimedia.org/T329827 (10jijiki) [09:46:54] Going to try 100% thumbor-k8s in both DCs [09:49:35] gl;hf [10:18:01] <_joe_> hnowlan: godspeed :) [10:19:07] 🥂 [10:20:35] <_joe_> things seem to hold up ok for now :) [10:20:38] <_joe_> that's amazing [10:21:07] \o/ [10:21:50] <_joe_> well done hnowlan :) [10:32:45] Let's wait until peak to celebrate :D [11:01:17] excellent news :-) [11:06:20] nice! [15:24:07] there's a missing restbase host in pybal that isn't in scap https://gerrit.wikimedia.org/r/c/mediawiki/services/restbase/deploy/+/915699 [15:24:16] it's an old host but it's still pooled [15:24:40] _joe_: I think we'll have to dig a bit into latency for mw-api-int [15:26:15] <_joe_> claime: probably add some pods? [15:26:25] <_joe_> claime: what's wrong there? [15:26:57] _joe_: p75 is ~200ms [15:27:08] <_joe_> and that's not unexpected [15:27:32] It doubled with the migration of recommendation-api [15:27:44] <_joe_> I guess the calls are more expensive [15:27:46] It's not unexpected but I'm not happy about it :'D [15:27:51] <_joe_> but I'm worried by the "envoy" part there [15:28:20] <_joe_> hnowlan: need anything from us? [15:28:42] hnowlan: +1'd [15:29:01] <_joe_> the metric we called "envoy" is envoy_http_downstream_rq_time_bucket [15:29:07] _joe_: I'm not finding data for the existing cluster [15:29:29] <_joe_> Let me look at that later [15:30:43] <_joe_> so downstream_rq_time is defined as "Total time for request and response (milliseconds)" [15:30:54] So total rtt [15:31:40] claime: ty! [15:33:02] <_joe_> claime: but the green line which is envoy_cluster_upstream_rq_time_bucket is documented as "request time (milliseconds)" [15:33:09] <_joe_> so I'm not sure what those two represent [15:34:52] _joe_: I think the upstream_rq_time doesn't include the time to receive the response? [15:35:09] <_joe_> then why it's higher [15:35:34] That's a very good question [15:53:34] <_joe_> claime: sigh sob, on the api servers, envoy reports the second time slightly higher than the first [15:53:43] <_joe_> which would make sense [15:54:54] I saw similar while trying to debug the total time added by envoy processing for slo purposes fwiw. upstream_rq_time_bucket is (afair) from the moment incoming request is received until the full upstream response is received by envoy [15:55:12] be sure you're filtering by the right envoy_http_conn_manager_prefix - I don't think api-gateway even *had* the downstream metrics for the non-admin interface [15:55:31] <_joe_> hnowlan: yeah I am doing that correctly I think, but I'm re-checking [15:56:52] <_joe_> yeah I am doing it correctly in explore but it was clearly wrong in the dashboard [16:02:05] <_joe_> claime: if you're curious, the backend times remained notably stable https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=recommendation-api&var-destination=mw-api-int-async&var-destination=mwapi-async&viewPanel=14&from=now-7d&to=now [16:02:09] <_joe_> in the transition [16:02:16] <_joe_> one exception is p99 being more noisy [16:11:36] put together a candidate for a briefer k8s-only dashboard for thumbor [16:11:49] RIP the "Thumbor" toggle button that did nothing https://grafana-rw.wikimedia.org/d/FMKakEyVz/hnowlan-thumbor-k8s [16:42:19] sadly the capacity increase either wasn't enough, or there's something more annoying afoot as regards scaling traffic on k8s [16:42:42] I am going to look at the timeouts tomorrow, I think we're backing requests up into queues too much and our client timeouts might be too short [16:47:17] 10serviceops, 10Shellbox, 10SyntaxHighlight, 10User-brennen, 10Wikimedia-production-error: Pages with Pygments or Timeline intermittenly fail to render (Shellbox server returned status code 503) - https://phabricator.wikimedia.org/T292663 (10Krinkle) [16:47:22] 10serviceops, 10Shellbox, 10SyntaxHighlight, 10User-brennen, 10Wikimedia-production-error: Pages with Pygments or Timeline intermittenly fail to render (Shellbox server returned status code 503) - https://phabricator.wikimedia.org/T292663 (10Krinkle) [17:02:49] You should have kept the toggle [17:02:52] Thumbor: yes [17:05:33] :D [17:05:44] forgot to enable the thumbor, rookie mistake [17:24:59] problem is definitely at the haproxy level, the workers on the backlogged pods aren't even busy when the errors come in [17:25:37] either a timeouts issue or something to do with haproxy's conception of healthy/unhealthy workers that I don't quite understand yet [20:33:37] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10RLazarus) 892570 is merged now, and I think we'll be in better shape for the next one. @Clement_Goubert I'm tempted to resolve this, and reopen if we...