[06:52:48] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review: Better scaffolding for helm charts / releases - https://phabricator.wikimedia.org/T292818 (10Joe) 05Open→03Resolved
[07:29:18] <wikibugs>	 10serviceops, 10SRE-OnFire, 10Traffic, 10conftool, 10Sustainability (Incident Followup): Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 (10Joe) I'm frankly not sure how checking appserver.svc.eaqiad.wmnet:9090 from...
[09:14:25] <wikibugs>	 10serviceops, 10Kubernetes, 10Patch-For-Review: Add a second control-plane to wikikube staging clusters - https://phabricator.wikimedia.org/T329827 (10jijiki)
[09:46:54] <hnowlan>	 Going to try 100% thumbor-k8s in both DCs 
[09:49:35] <claime>	 gl;hf
[10:18:01] <_joe_>	 hnowlan: godspeed :)
[10:19:07] <apergos>	 🥂 
[10:20:35] <_joe_>	 things seem to hold up ok for now :)
[10:20:38] <_joe_>	 that's amazing
[10:21:07] <claime>	 \o/
[10:21:50] <_joe_>	 well done hnowlan :)
[10:32:45] <hnowlan>	 Let's wait until peak to celebrate :D 
[11:01:17] <moritzm>	 excellent news :-)
[11:06:20] <akosiaris>	 nice!
[15:24:07] <hnowlan>	 there's a missing restbase host in pybal that isn't in scap https://gerrit.wikimedia.org/r/c/mediawiki/services/restbase/deploy/+/915699 
[15:24:16] <hnowlan>	 it's an old host but it's still pooled 
[15:24:40] <claime>	 _joe_: I think we'll have to dig a bit into latency for mw-api-int
[15:26:15] <_joe_>	 claime: probably add some pods?
[15:26:25] <_joe_>	 claime: what's wrong there?
[15:26:57] <claime>	 _joe_: p75 is ~200ms
[15:27:08] <_joe_>	 and that's not unexpected
[15:27:32] <claime>	 It doubled with the migration of recommendation-api
[15:27:44] <_joe_>	 I guess the calls are more expensive
[15:27:46] <claime>	 It's not unexpected but I'm not happy about it :'D
[15:27:51] <_joe_>	 but I'm worried by the "envoy" part there
[15:28:20] <_joe_>	 hnowlan: need anything from us?
[15:28:42] <claime>	 hnowlan: +1'd
[15:29:01] <_joe_>	 the metric we called "envoy" is envoy_http_downstream_rq_time_bucket
[15:29:07] <claime>	 _joe_: I'm not finding data for the existing cluster
[15:29:29] <_joe_>	 Let me look at that later
[15:30:43] <_joe_>	 so downstream_rq_time is defined as "Total time for request and response (milliseconds)"
[15:30:54] <claime>	 So total rtt
[15:31:40] <hnowlan>	 claime: ty! 
[15:33:02] <_joe_>	 claime: but the green line which is envoy_cluster_upstream_rq_time_bucket is documented as "request time (milliseconds)"
[15:33:09] <_joe_>	 so I'm not sure what those two represent
[15:34:52] <claime>	 _joe_: I think the upstream_rq_time doesn't include the time to receive the response?
[15:35:09] <_joe_>	 then why it's higher
[15:35:34] <claime>	 That's a very good question
[15:53:34] <_joe_>	 claime: sigh sob, on the api servers, envoy reports the second time slightly higher than the first
[15:53:43] <_joe_>	 which would make sense
[15:54:54] <hnowlan>	 I saw similar while trying to debug the total time added by envoy processing for slo purposes fwiw. upstream_rq_time_bucket is (afair) from the moment incoming request is received until the full upstream response is received by envoy 
[15:55:12] <hnowlan>	 be sure you're filtering by the right envoy_http_conn_manager_prefix - I don't think api-gateway even *had* the downstream metrics for the non-admin interface
[15:55:31] <_joe_>	 hnowlan: yeah I am doing that correctly I think, but I'm re-checking
[15:56:52] <_joe_>	 yeah I am doing it correctly in explore but it was clearly wrong in the dashboard
[16:02:05] <_joe_>	 claime: if you're curious, the backend times remained notably stable https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=recommendation-api&var-destination=mw-api-int-async&var-destination=mwapi-async&viewPanel=14&from=now-7d&to=now
[16:02:09] <_joe_>	 in the transition
[16:02:16] <_joe_>	 one exception is p99 being more noisy
[16:11:36] <hnowlan>	 put together a candidate for a briefer k8s-only dashboard for thumbor 
[16:11:49] <hnowlan>	 RIP the "Thumbor" toggle button that did nothing https://grafana-rw.wikimedia.org/d/FMKakEyVz/hnowlan-thumbor-k8s
[16:42:19] <hnowlan>	 sadly the capacity increase either wasn't enough, or there's something more annoying afoot as regards scaling traffic on k8s 
[16:42:42] <hnowlan>	 I am going to look at the timeouts tomorrow, I think we're backing requests up into queues too much and our client timeouts might be too short 
[16:47:17] <wikibugs>	 10serviceops, 10Shellbox, 10SyntaxHighlight, 10User-brennen, 10Wikimedia-production-error: Pages with Pygments or Timeline intermittenly fail to render (Shellbox server returned status code 503) - https://phabricator.wikimedia.org/T292663 (10Krinkle)
[16:47:22] <wikibugs>	 10serviceops, 10Shellbox, 10SyntaxHighlight, 10User-brennen, 10Wikimedia-production-error: Pages with Pygments or Timeline intermittenly fail to render (Shellbox server returned status code 503) - https://phabricator.wikimedia.org/T292663 (10Krinkle)
[17:02:49] <claime>	 You should have kept the toggle
[17:02:52] <claime>	 Thumbor: yes
[17:05:33] <hnowlan>	 :D
[17:05:44] <hnowlan>	 forgot to enable the thumbor, rookie mistake
[17:24:59] <hnowlan>	 problem is definitely at the haproxy level, the workers on the backlogged pods aren't even busy when the errors come in 
[17:25:37] <hnowlan>	 either a timeouts issue or something to do with haproxy's conception of healthy/unhealthy workers that I don't quite understand yet 
[20:33:37] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10RLazarus) 892570 is merged now, and I think we'll be in better shape for the next one. @Clement_Goubert I'm tempted to resolve this, and reopen if we...