[09:49:03] <elukey>	 akosiaris: o/
[09:49:15] <elukey>	 so the Istio metrics for wikikube codfw are not published
[09:49:29] <elukey>	 not sure if they changed completely or not
[09:49:43] <elukey>	 https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&var-cluster=codfw%20prometheus%2Fk8s&var-namespace=All&var-backend=All&var-response_code=All&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99
[09:50:23] <akosiaris>	 heh, I had forgotten about that dashboard
[09:50:26] <akosiaris>	 lemme have a quick look
[09:52:16] <elukey>	 I checked via nsenter on 2017, I only see envoy_ metrics
[09:52:28] <elukey>	 using localhost:15000/stats/prometheus
[09:52:41] <elukey>	 ouch
[09:54:54] <elukey>	 I think I missed / didn't test this when I imported 1.15.x
[09:55:05] <elukey>	 task is https://phabricator.wikimedia.org/T322193
[09:58:07] * akosiaris starting from scratch here, will take me a bit
[10:00:43] <akosiaris>	 last data for codfw istio_requests_total is indeed around the time the re-imaging started 
[10:01:22] <akosiaris>	 looking into the k8s-pods job now 
[10:03:16] <akosiaris>	 hmm, I can see many pods not being marked as up by prometheus
[10:06:04] <elukey>	 I am wondering if there is a new telemetry flag to turn on
[10:16:13] <akosiaris>	 ok, the pods "down" are shellbox, mw-web and developer-portal stuff
[10:16:20] <akosiaris>	 unrelated 
[10:16:28] <akosiaris>	 still something to fix but for later
[10:17:42] <akosiaris>	 the tls-pods down are all tegola
[10:17:52] <akosiaris>	 similar thing
[10:18:28] <akosiaris>	 so I am being directed towards your theory now 
[10:19:48] <akosiaris>	 so, the old metrics were at a port 15020
[10:20:12] <akosiaris>	 but I see no such target ? 
[10:21:08] <elukey>	 so in the pods I see the following annotations
[10:21:08] <elukey>	                       prometheus.io/path: /stats/prometheus
[10:21:09] <elukey>	                       prometheus.io/port: 15020
[10:21:09] <elukey>	                       prometheus.io/scrape: true
[10:21:27] <elukey>	 that are good, but if I query them via nsenter I don't see the metrics
[10:21:37] <elukey>	 I mean istio_etc.. metrics
[10:21:37] <akosiaris>	 scratch that, my mistake, I am blind
[10:21:52] <elukey>	 there are some but only envoy/xds related..
[10:22:04] <elukey>	 lemme check on other nodes
[10:22:29] <akosiaris>	 yeah, in eqiad I can see tons more metrics
[10:23:08] <akosiaris>	 curl |wc-l says 934 lines in a ingress istio pod in codfw vs 4282 in eqiad
[10:23:19] <akosiaris>	 so, something in istio
[10:25:20] <akosiaris>	 istio_agent_pilot_xds{version="1.9.5"} exists in both clusters
[10:25:29] <akosiaris>	 but codfw doesn't have much more than that
[10:26:27] <akosiaris>	 there are all the istio_agent_* metrics, but no istio_requests_* metrics
[10:30:05] <elukey>	 I thought that maybe some pods didn't get traffic and hence metrics didn't show up, but it seems that all of them have the same issue
[10:34:52] <akosiaris>	 I am a bit at a loss figuring out from deployment-charts which istio version runs where
[10:46:21] <elukey>	 akosiaris: we have everything under custom.d/, we use those manifests with istioctl to deploy the various versions
[10:46:31] <elukey>	 not 100% perfect but it was the best compromise
[10:46:48] <akosiaris>	 ah! That explains my failure
[10:46:48] <elukey>	 with k8s 1.23 we run istio 1.15.4
[10:46:50] <elukey>	 err .3
[10:46:55] <elukey>	 the rest uses 1.9.5
[10:47:03] <akosiaris>	 I was trying under helmfile.d
[10:48:24] <akosiaris>	 in codfw btw I get this istio_agent_pilot_xds{version="1.9.5"}
[10:48:32] <akosiaris>	 I am not sure if it should be instead 1.15.4 
[10:50:11] <elukey>	 for istiod?
[10:50:17] <elukey>	 or istio-gateway?
[10:50:31] <elukey>	 weird in theory it shouldn't
[10:50:42] <akosiaris>	 istio-ingressgateway-qzvgd
[10:51:05] <akosiaris>	 curl http://10.194.135.65:15020/stats/prometheus | grep istio_ |grep version from a prometheus host
[10:51:15] <elukey>	 ahhh yes it is my fault
[10:51:44] <akosiaris>	 awesome! Now that we have a scapegoat :D, what was the fault ? 
[10:53:23] <elukey>	 in the proxyv2 Dockerfile there is a ISTIO_META_ISTIO_VERSION=1.9.5 that is clearly wrong :(
[10:53:29] <elukey>	 I am going to send a patch
[10:53:34] <elukey>	 but not sure if it is the issue
[10:56:29] <elukey>	 https://gerrit.wikimedia.org/r/891506
[10:58:14] <akosiaris>	 so, my google foo says there is a proxy.proxyVersion field in envoyfilter 
[10:58:20] <akosiaris>	 and it's currently
[10:58:26] <akosiaris>	  Proxy Version:  ^1\.15.*
[10:58:48] <akosiaris>	 so, your theory that is might be related is starting to gain some evidence
[10:59:33] <elukey>	 ah wow
[10:59:42] <elukey>	 I am rebuilding the docker images
[10:59:48] <akosiaris>	 meeting, bbl
[11:04:36] <wikibugs>	 10serviceops, 10Performance-Team, 10Patch-For-Review: Rewrite mw-warmup.js in Python - https://phabricator.wikimedia.org/T288867 (10Volans) [not a blocker, not for next week] Thinking a bit more about it I'd like to suggest an alternative approach for future usage, basically integrate it a bit more into the...
[11:14:21] <elukey>	 akosiaris: I deployed the change to ml-staging manually, I see metrics now :)
[11:16:45] <elukey>	 created https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/891508/
[11:16:57] <elukey>	 then we have to roll it out via istioctl in the various clusters
[11:21:35] <claime>	 I feel like the sre.discovery.datacenter has been sufficiently tested during the switch maintenances that it can replace the sre.switchdc.services cookbook for the switchover. Thoughts?
[11:29:39] <claime>	 This would only involve calling it with --all so it switches A/P services too
[11:55:45] <akosiaris>	 elukey: lol, I was 50/50 that it would indeed be that, happy to hear we were correct!
[11:55:53] <akosiaris>	 claime: +1
[11:56:17] <akosiaris>	 elukey: I 'd like to roll it out in codfw, just to see it happen once 
[11:56:42] <akosiaris>	 I might ask for your help
[12:40:22] <wikibugs>	 10serviceops, 10RESTBase, 10Datacenter-Switchover: Figure out plan for restbase-async w/r database switchover - https://phabricator.wikimedia.org/T285711 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert The plan we landed on is:     - Use the `sre.discovery.datacenter` to depool `eqiad` comple...
[12:40:28] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert)
[12:49:23] <wikibugs>	 10serviceops, 10Datacenter-Switchover: switchdc services cookbook should allow pooling services in both DCs (active/active) - https://phabricator.wikimedia.org/T290919 (10Clement_Goubert) The `sre.discovery.datacenter` cookbook allows to depool or repool a full datacenter (excluding A/A mediawiki services) at...
[12:49:33] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert)
[12:49:39] <wikibugs>	 10serviceops, 10Datacenter-Switchover: switchdc services cookbook should allow pooling services in both DCs (active/active) - https://phabricator.wikimedia.org/T290919 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert
[12:50:24] <wikibugs>	 10serviceops, 10Data-Persistence (work done), 10SRE, 10Datacenter-Switchover, and 2 others: sre.switchdc.mediawiki.07-set-readwrite doesn't reset both datacenter to rw - https://phabricator.wikimedia.org/T330300 (10Clement_Goubert) 05Open→03Resolved
[12:50:30] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert)
[12:50:38] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: Ensure sre.switchdc.mediawiki live test multi-DC compatibility - https://phabricator.wikimedia.org/T329065 (10Clement_Goubert)
[13:34:17] <elukey>	 akosiaris: just seen the msg, happy to help if needed :)
[13:37:51] <elukey>	 akosiaris: do you want to roll it out to the staging clusters as well?
[13:40:34] <akosiaris>	 sure, why not
[14:12:05] <akosiaris>	 elukey: codfw done. https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&var-cluster=codfw%20prometheus%2Fk8s&var-namespace=All&var-backend=All&var-response_code=All&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99
[14:12:08] <akosiaris>	 that was easy
[14:12:25] <akosiaris>	 docs lack a mention that you need to first kube_env though
[14:18:01] <elukey>	 ah yes
[14:19:45] <akosiaris>	 istioctl proxy-status command was pretty useful. It reports 1.9.5 so it was super easy to figure out which clusters were not upgraded
[14:21:50] <elukey>	 yep istioctl is really nice
[14:39:01] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Mvolz)
[14:54:31] <claime>	 Someone on -tech is complaining about restbase/parsoid 400
[14:57:02] <claime>	 wait it may be someone running mediawiki themselves
[14:57:46] <claime>	 yeah, it is
[15:35:41] <wikibugs>	 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-notice: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 (10Ladsgroup) The infrastructure for splitting IS.php is in place now and it has become around 4...
[15:54:27] <wikibugs>	 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-notice: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 (10Ladsgroup) 05Open→03Resolved
[15:54:43] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Platform Engineering, 10Scap, and 6 others: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 (10Ladsgroup)
[16:14:35] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) Incident doc relating the minor editing incident due to {T330300} https://wikitech.wikimedia.org/wiki/Incidents/2023-02-22_read_only
[16:27:56] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10eoghan)
[20:05:45] <wikibugs>	 10serviceops, 10Performance-Team, 10Patch-For-Review: Rewrite mw-warmup.js in Python - https://phabricator.wikimedia.org/T288867 (10RLazarus) Sure, we could look at adding a warmup step to the server repool process. Historically we haven't worried about it, because the impact for one host is much smaller tha...
[23:17:39] <wikibugs>	 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Quiddity)