[07:23:46] 10serviceops, 10Observability-Metrics, 10observability: Scrape more envoy metrics in ops prometheus - https://phabricator.wikimedia.org/T317430 (10fgiunchedi) +1 to add connection metrics and see what the effects are [08:15:22] hello folks [08:15:50] I am trying to get metrics from istio's envoy sidecar in ml-serve, and I found the following format for metrics [08:15:53] envoy_cluster_discovery_wmnet_upstream_cx_length_ms_bucket{cluster_name="outbound|443||api-ro",le="3600000"} 305 [08:16:14] that is very far from what you do, afaics, for the tls-proxy terminators [08:17:09] I am trying to move our config as much as possible similar to what ServiceOps has, but I may need to use a different standard for istio [08:17:14] lemme know if you have preferences etc.. [08:18:53] "in theory" if I manage to somehow force envoy on istio sidecars to emit the above metrics without the discovery_wmnet bit it may work [08:32:16] uhm...that's odd [08:32:39] do you know where this comes from (the discovery_wmnet part)? [08:38:56] I guess it is part of the istio's envoy auto-generated config [08:39:15] I wanted to dump it and check it to see if I find some clues [08:42:26] 444064 envoy_config.txt [08:42:30] it doesn't start well [08:44:42] I see outbound|443||api-ro.discovery.wmnet in the config, that is trange [08:44:45] *strange [08:46:41] "cluster": { [08:46:41] "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster", [08:46:44] "name": "outbound|443||api-ro.discovery.wmnet", [08:46:45] etc... [08:47:14] jayme: do we do any kind of prometheus metric rewrite or similar? [08:47:42] not within k8s, no [08:47:53] s/within/for/ [08:56:56] jayme: https://github.com/envoyproxy/envoy/issues/4357 this looks related [09:11:10] the cluster_name is probably fine, no? [09:11:29] I was more wondering about the name of the metric containing discovery_wmnet [09:18:38] jayme: my impression is that istio sets the cluster name as api-ro.discovery.wmnet, and the correspondent prometheus stats hit the issue [09:18:42] because of the dots [09:19:12] there is also https://github.com/envoyproxy/envoy/issues/4357#issuecomment-422169416 in the same gh issue [09:23:51] yeah...but I'm wondering why we don't have the cluster name in other metric names currently [09:24:31] I think that the cluster names defined in the tls-proxies are something like "mwapi-async" [09:24:33] service-proxy just exports envoy_cluster_upstream_cx_length_ms_bucket{envoy_cluster_name="api-rw"...} for example [09:24:56] yes, yes. but it's not part of the *metric* name [09:25:02] just the label value [09:25:24] nvoy_cluster_discovery_wmnet_upstream_cx_length_ms_bucket vs envoy_cluster_upstream_cx_length_ms_bucket [09:27:35] yes yes what I was trying to say is that maybe in the final enovoy config of the tls-proxies the cluster->name field is not a hostname, but something like mwapi-async [09:28:13] I am checking now with /config_dump on a random tls-proxy on kubernetes1012 and I don't see a cluster->name with dots [09:29:12] there are addresses like api-rw.discovery.wmnet, but not in the "name" field [09:29:24] like [09:29:24] "cluster": { [09:29:24] "@type": "type.googleapis.com/envoy.api.v2.Cluster", [09:29:24] "name": "mwapi-async", [09:30:46] ah, okay [09:36:39] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [09:38:50] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [09:52:50] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [09:53:20] 10serviceops, 10Observability-Alerting, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Migrate kubernetes alerts away from icinga - https://phabricator.wikimedia.org/T311251 (10JMeybohm) 05Open→03Resolved [10:25:16] jayme: found https://github.com/istio/istio/pull/39162 [10:25:47] 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Clement_Goubert) Thanks, I was able to complete the servers' powerdown through the management interface by using the asset tag FQDN. `wtp[1029-1033].eqiad.wmnet` n... [10:26:07] but no joy, a proper fix hasn't been merged yet [10:26:27] sweet [10:27:38] 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Clement_Goubert) [11:15:19] 10serviceops, 10Observability-Metrics, 10observability, 10Patch-For-Review: Scrape more envoy metrics in ops prometheus - https://phabricator.wikimedia.org/T317430 (10JMeybohm) a:03JMeybohm [13:24:00] 10serviceops, 10Observability-Logging, 10WMF-General-or-Unknown, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Ingest logs from scheduled maintenance scripts at WMF in Logstash - https://phabricator.wikimedia.org/T285896 (10Tgr) Random ping! Would someone be interested in pushing this... [13:34:59] 10serviceops, 10Wikimedia-Incident: Add etcdmirror connection retry on etcd-tls-proxy unavailability - https://phabricator.wikimedia.org/T317535 (10Clement_Goubert) [14:02:31] 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by volans@cumin1001 for hosts: `wtp[1028-1030]` - wtp1028 (**FAIL**) - //No DNS record found for th... [14:03:52] 10serviceops, 10Wikimedia-Incident: Update Etcd/Main cluster#Replication documentation with safe restart conditions and information - https://phabricator.wikimedia.org/T317537 (10Clement_Goubert) [14:04:40] jayme: for the moment the only idea that I have is to add some prometheus metric rewrite bits for the k8s-mlserve instance, with some "docs" linking to the github issue etc.. [14:05:24] elukey: what's the ultimate reason for rewriting that? Compatibility with existing dashboards? [14:05:41] istio provides the auto-magic prometheus + grafana deployment that offers a lot of good dashboards, not sure how they do it [14:06:11] 10serviceops, 10SRE-OnFire, 10Wikimedia-Incident: Add etcdmirror connection retry on etcd-tls-proxy unavailability - https://phabricator.wikimedia.org/T317535 (10Clement_Goubert) [14:06:14] jayme: yes I'd like to keep envoy metrics as close as possible, but we can think about having a separate dashboard for istio sidecars as well [14:06:31] 10serviceops, 10SRE-OnFire, 10Wikimedia-Incident: Update Etcd/Main cluster#Replication documentation with safe restart conditions and information - https://phabricator.wikimedia.org/T317537 (10Clement_Goubert) [14:07:35] elukey: if the metric name is still predictible a hack in grafana might as well work [14:08:00] like "envoy_cluster_upstream_cx_length_ms_bucket{} or envoy_cluster_discovery_wmnet_upstream_cx_length_ms_bucket{}" [14:08:37] jayme: I can already see people super confused and hating me for the hack to be honest :D [14:09:08] or even a custom variable that gets set to _discovery_wmnet in case ml-clusters are selected and then "envoy_cluster${lucas_hack}_upstream_cx_length_ms_bucket{}" [14:09:13] yeah... [14:09:49] I'll see if I can remove the discovery_wmnet bit, not really useful.. if so in theory the ml-team could re-use the envoy-telemetry dashboard [14:10:03] yep [14:11:01] wo go with rewriting you would need to make sure that there are no collisions and you will probably have to add support for it in prometheus puppet (the k8s part). Not sure what we have there because discovery is done via k8s API and not static as for ops prometheus for example [14:11:07] *to go with [14:13:40] 10serviceops, 10Observability-Metrics, 10observability: Scrape more envoy metrics in ops prometheus - https://phabricator.wikimedia.org/T317430 (10fgiunchedi) That added about 13k samples/s and 800k metrics, LGTM https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?orgId=1&from=1662980478669&to=16629... [14:19:11] jayme: not afraid of collision for the moment, but the other bit is definitely a concern, sigh [14:20:00] 10serviceops, 10Observability-Metrics, 10observability: Scrape more envoy metrics in ops prometheus - https://phabricator.wikimedia.org/T317430 (10JMeybohm) 05Open→03Resolved Nice. Resolving then [15:02:27] 10serviceops, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos, 10SRE, 10Community-Tech (CommTech-Sprint-33): SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10JMcLeod_WMF) [15:35:50] 10serviceops, 10Discovery-Search: Coordinate with ServiceOps Team about a rework of the Search Update Pipeline - https://phabricator.wikimedia.org/T317283 (10Gehel) [15:36:00] 10serviceops, 10Discovery-Search (Current work): Coordinate with ServiceOps Team about a rework of the Search Update Pipeline - https://phabricator.wikimedia.org/T317283 (10MPhamWMF) [18:01:47] 10serviceops, 10Phabricator, 10serviceops-collab, 10Patch-For-Review, 10Release-Engineering-Team (Bonus Level 🕹ī¸): Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 (10thcipriani) [18:01:57] 10serviceops, 10Phabricator, 10serviceops-collab, 10Release-Engineering-Team (Bonus Level 🕹ī¸): Email tool maintainers about git-ssh deprecation on phabricator - https://phabricator.wikimedia.org/T313359 (10thcipriani) 05Open→03Invalid Emailing all tool maintainers about diffusion seems a little moot gi... [18:09:19] 10serviceops, 10Phabricator, 10serviceops-collab, 10Release-Engineering-Team (Bonus Level 🕹ī¸): Email tool maintainers about git-ssh deprecation on phabricator - https://phabricator.wikimedia.org/T313359 (10Dzahn) Yea, true. I think it did. the few users that were not striker repos but were still using svn... [18:10:48] 10serviceops, 10Phabricator, 10serviceops-collab, 10Release-Engineering-Team (Bonus Level 🕹ī¸): Email tool maintainers about git-ssh deprecation on phabricator - https://phabricator.wikimedia.org/T313359 (10Dzahn) There is still T308061 about subversion repos, fwiw. [19:19:42] 10serviceops, 10Data Engineering Planning, 10SRE, 10Event-Platform Value Stream (Sprint 01), 10Patch-For-Review: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10gmodena) Hi - what is the status of the linked CR? >>! In T303543#7768019, @gerritbot wrote: > Chang... [20:47:35] 10serviceops, 10Observability-Logging, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1): Increase of ~50 million access logs per day from mobileapps-production-tls-proxy - https://phabricator.wikimedia.org/T313099 (10lmata) [21:02:17] 10serviceops, 10Phabricator, 10serviceops-collab, 10Release-Engineering-Team (Bonus Level 🕹ī¸): Email tool maintainers about git-ssh deprecation on phabricator - https://phabricator.wikimedia.org/T313359 (10Dzahn) @thcipriani I think it also means I can just shutdown git-ssh and it should not affect anyone.... [21:06:31] 10serviceops, 10Phabricator, 10serviceops-collab, 10Release-Engineering-Team (Bonus Level 🕹ī¸): Email tool maintainers about git-ssh deprecation on phabricator - https://phabricator.wikimedia.org/T313359 (10thcipriani) >>! In T313359#8230507, @Dzahn wrote: > @thcipriani I think it also means I can just shut... [22:03:55] 10serviceops, 10SRE: mediawiki::api: net.ipv4.local_port_range sysctl config does not exist - https://phabricator.wikimedia.org/T317454 (10Dzahn) thanks @paladox confirmed. it's `ip_local_port_range` under `/ipv4/`. https://tldp.org/LDP/solrhe/Securing-Optimizing-Linux-RH-Edition-v1.3/chap6sec70.html [22:14:49] 10serviceops, 10Phabricator, 10serviceops-collab, 10Patch-For-Review, 10Release-Engineering-Team (Bonus Level 🕹ī¸): Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 (10Dzahn) [22:15:06] 10serviceops, 10Phabricator, 10serviceops-collab, 10Patch-For-Review, 10Release-Engineering-Team (Bonus Level 🕹ī¸): Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 (10Dzahn) [22:15:35] 10serviceops, 10Phabricator, 10serviceops-collab, 10Patch-For-Review, 10Release-Engineering-Team (Bonus Level 🕹ī¸): Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 (10Dzahn) also T313359#8230507 [22:17:02] 10serviceops, 10Phabricator, 10serviceops-collab, 10Patch-For-Review, 10Release-Engineering-Team (Bonus Level 🕹ī¸): Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 (10Dzahn) striker repos have been migrated:) (Thanks @bd808!) Hmm. now..I still notice th...