[09:49:03] akosiaris: o/ [09:49:15] so the Istio metrics for wikikube codfw are not published [09:49:29] not sure if they changed completely or not [09:49:43] https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&var-cluster=codfw%20prometheus%2Fk8s&var-namespace=All&var-backend=All&var-response_code=All&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99 [09:50:23] heh, I had forgotten about that dashboard [09:50:26] lemme have a quick look [09:52:16] I checked via nsenter on 2017, I only see envoy_ metrics [09:52:28] using localhost:15000/stats/prometheus [09:52:41] ouch [09:54:54] I think I missed / didn't test this when I imported 1.15.x [09:55:05] task is https://phabricator.wikimedia.org/T322193 [09:58:07] * akosiaris starting from scratch here, will take me a bit [10:00:43] last data for codfw istio_requests_total is indeed around the time the re-imaging started [10:01:22] looking into the k8s-pods job now [10:03:16] hmm, I can see many pods not being marked as up by prometheus [10:06:04] I am wondering if there is a new telemetry flag to turn on [10:16:13] ok, the pods "down" are shellbox, mw-web and developer-portal stuff [10:16:20] unrelated [10:16:28] still something to fix but for later [10:17:42] the tls-pods down are all tegola [10:17:52] similar thing [10:18:28] so I am being directed towards your theory now [10:19:48] so, the old metrics were at a port 15020 [10:20:12] but I see no such target ? [10:21:08] so in the pods I see the following annotations [10:21:08] prometheus.io/path: /stats/prometheus [10:21:09] prometheus.io/port: 15020 [10:21:09] prometheus.io/scrape: true [10:21:27] that are good, but if I query them via nsenter I don't see the metrics [10:21:37] I mean istio_etc.. metrics [10:21:37] scratch that, my mistake, I am blind [10:21:52] there are some but only envoy/xds related.. [10:22:04] lemme check on other nodes [10:22:29] yeah, in eqiad I can see tons more metrics [10:23:08] curl |wc-l says 934 lines in a ingress istio pod in codfw vs 4282 in eqiad [10:23:19] so, something in istio [10:25:20] istio_agent_pilot_xds{version="1.9.5"} exists in both clusters [10:25:29] but codfw doesn't have much more than that [10:26:27] there are all the istio_agent_* metrics, but no istio_requests_* metrics [10:30:05] I thought that maybe some pods didn't get traffic and hence metrics didn't show up, but it seems that all of them have the same issue [10:34:52] I am a bit at a loss figuring out from deployment-charts which istio version runs where [10:46:21] akosiaris: we have everything under custom.d/, we use those manifests with istioctl to deploy the various versions [10:46:31] not 100% perfect but it was the best compromise [10:46:48] ah! That explains my failure [10:46:48] with k8s 1.23 we run istio 1.15.4 [10:46:50] err .3 [10:46:55] the rest uses 1.9.5 [10:47:03] I was trying under helmfile.d [10:48:24] in codfw btw I get this istio_agent_pilot_xds{version="1.9.5"} [10:48:32] I am not sure if it should be instead 1.15.4 [10:50:11] for istiod? [10:50:17] or istio-gateway? [10:50:31] weird in theory it shouldn't [10:50:42] istio-ingressgateway-qzvgd [10:51:05] curl http://10.194.135.65:15020/stats/prometheus | grep istio_ |grep version from a prometheus host [10:51:15] ahhh yes it is my fault [10:51:44] awesome! Now that we have a scapegoat :D, what was the fault ? [10:53:23] in the proxyv2 Dockerfile there is a ISTIO_META_ISTIO_VERSION=1.9.5 that is clearly wrong :( [10:53:29] I am going to send a patch [10:53:34] but not sure if it is the issue [10:56:29] https://gerrit.wikimedia.org/r/891506 [10:58:14] so, my google foo says there is a proxy.proxyVersion field in envoyfilter [10:58:20] and it's currently [10:58:26] Proxy Version: ^1\.15.* [10:58:48] so, your theory that is might be related is starting to gain some evidence [10:59:33] ah wow [10:59:42] I am rebuilding the docker images [10:59:48] meeting, bbl [11:04:36] 10serviceops, 10Performance-Team, 10Patch-For-Review: Rewrite mw-warmup.js in Python - https://phabricator.wikimedia.org/T288867 (10Volans) [not a blocker, not for next week] Thinking a bit more about it I'd like to suggest an alternative approach for future usage, basically integrate it a bit more into the... [11:14:21] akosiaris: I deployed the change to ml-staging manually, I see metrics now :) [11:16:45] created https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/891508/ [11:16:57] then we have to roll it out via istioctl in the various clusters [11:21:35] I feel like the sre.discovery.datacenter has been sufficiently tested during the switch maintenances that it can replace the sre.switchdc.services cookbook for the switchover. Thoughts? [11:29:39] This would only involve calling it with --all so it switches A/P services too [11:55:45] elukey: lol, I was 50/50 that it would indeed be that, happy to hear we were correct! [11:55:53] claime: +1 [11:56:17] elukey: I 'd like to roll it out in codfw, just to see it happen once [11:56:42] I might ask for your help [12:40:22] 10serviceops, 10RESTBase, 10Datacenter-Switchover: Figure out plan for restbase-async w/r database switchover - https://phabricator.wikimedia.org/T285711 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert The plan we landed on is: - Use the `sre.discovery.datacenter` to depool `eqiad` comple... [12:40:28] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert) [12:49:23] 10serviceops, 10Datacenter-Switchover: switchdc services cookbook should allow pooling services in both DCs (active/active) - https://phabricator.wikimedia.org/T290919 (10Clement_Goubert) The `sre.discovery.datacenter` cookbook allows to depool or repool a full datacenter (excluding A/A mediawiki services) at... [12:49:33] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert) [12:49:39] 10serviceops, 10Datacenter-Switchover: switchdc services cookbook should allow pooling services in both DCs (active/active) - https://phabricator.wikimedia.org/T290919 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert [12:50:24] 10serviceops, 10Data-Persistence (work done), 10SRE, 10Datacenter-Switchover, and 2 others: sre.switchdc.mediawiki.07-set-readwrite doesn't reset both datacenter to rw - https://phabricator.wikimedia.org/T330300 (10Clement_Goubert) 05Open→03Resolved [12:50:30] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) [12:50:38] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: Ensure sre.switchdc.mediawiki live test multi-DC compatibility - https://phabricator.wikimedia.org/T329065 (10Clement_Goubert) [13:34:17] akosiaris: just seen the msg, happy to help if needed :) [13:37:51] akosiaris: do you want to roll it out to the staging clusters as well? [13:40:34] sure, why not [14:12:05] elukey: codfw done. https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&var-cluster=codfw%20prometheus%2Fk8s&var-namespace=All&var-backend=All&var-response_code=All&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99 [14:12:08] that was easy [14:12:25] docs lack a mention that you need to first kube_env though [14:18:01] ah yes [14:19:45] istioctl proxy-status command was pretty useful. It reports 1.9.5 so it was super easy to figure out which clusters were not upgraded [14:21:50] yep istioctl is really nice [14:39:01] 10serviceops, 10SRE, 10Patch-For-Review: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Mvolz) [14:54:31] Someone on -tech is complaining about restbase/parsoid 400 [14:57:02] wait it may be someone running mediawiki themselves [14:57:46] yeah, it is [15:35:41] 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-notice: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 (10Ladsgroup) The infrastructure for splitting IS.php is in place now and it has become around 4... [15:54:27] 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-notice: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 (10Ladsgroup) 05Open→03Resolved [15:54:43] 10serviceops, 10MW-on-K8s, 10Platform Engineering, 10Scap, and 6 others: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 (10Ladsgroup) [16:14:35] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) Incident doc relating the minor editing incident due to {T330300} https://wikitech.wikimedia.org/wiki/Incidents/2023-02-22_read_only [16:27:56] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10eoghan) [20:05:45] 10serviceops, 10Performance-Team, 10Patch-For-Review: Rewrite mw-warmup.js in Python - https://phabricator.wikimedia.org/T288867 (10RLazarus) Sure, we could look at adding a warmup step to the server repool process. Historically we haven't worried about it, because the impact for one host is much smaller tha... [23:17:39] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Quiddity)