[05:36:27] 10serviceops, 10SRE-OnFire, 10Traffic, 10conftool, and 2 others: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 (10Joe) 05Open→03Resolved AIUI this is now resolved [06:03:26] 10serviceops, 10envoy: Puppet doesn't self-recover on build-envoy-config failure - https://phabricator.wikimedia.org/T346129 (10Joe) Yes there were very good reasons not to use puppet's concat - it doesn't allow proper merging of complex data structures and thus is more liable to generate issues; also position... [08:10:56] 10serviceops, 10Release-Engineering-Team, 10Scap: restbase deploys via scap lead to all hosts being disabled in conftool - https://phabricator.wikimedia.org/T346354 (10jnuche) 05Open→03Resolved a:03jnuche While working on the solution, I reproduced the bug and tested the fix locally as thoroughly as po... [08:21:12] 10serviceops, 10Release-Engineering-Team, 10Scap: restbase deploys via scap lead to all hosts being disabled in conftool - https://phabricator.wikimedia.org/T346354 (10akosiaris) We 'll schedule a scap deploy for RESTBase, thanks @jnuche [09:01:48] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 (10JMeybohm) [09:37:10] 10serviceops, 10envoy: Puppet doesn't self-recover on build-envoy-config failure - https://phabricator.wikimedia.org/T346129 (10fgiunchedi) I've looked into the puppet logs from the first puppet run on `cumin1001:/var/log/spicerack/sre/hosts/reimage/202309130825_filippo_2981305_titan1001.out` and the initial f... [10:02:11] 10serviceops, 10envoy, 10Patch-For-Review: Puppet doesn't self-recover on build-envoy-config failure - https://phabricator.wikimedia.org/T346129 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I'll optimistically call this specific issue resolved, the nail in the coffin will be file-based xds for envoy [10:17:47] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Scrape controller-manager and scheduler metrics - https://phabricator.wikimedia.org/T324959 (10JMeybohm) [10:20:48] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Scrape controller-manager and scheduler metrics - https://phabricator.wikimedia.org/T324959 (10JMeybohm) [10:46:51] 10serviceops, 10SRE, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: MediaWiki - https://phabricator.wikimedia.org/T346474 (10kamila) [11:03:37] 10serviceops, 10MW-on-K8s: mw-on-k8s tls-proxy container CPU throttling at low average load - https://phabricator.wikimedia.org/T344814 (10Clement_Goubert) [11:04:28] 10serviceops, 10Patch-For-Review: Remove tls-proxy cpu limits on eventstreams - https://phabricator.wikimedia.org/T345243 (10Clement_Goubert) 05Open→03In progress [11:06:04] 10serviceops, 10MW-on-K8s: mw-on-k8s tls-proxy container CPU throttling at low average load - https://phabricator.wikimedia.org/T344814 (10Clement_Goubert) [11:06:22] 10serviceops, 10MW-on-K8s: mw-on-k8s tls-proxy container CPU throttling at low average load - https://phabricator.wikimedia.org/T344814 (10Clement_Goubert) [11:06:33] 10serviceops, 10Patch-For-Review: Remove tls-proxy cpu limits on eventgate - https://phabricator.wikimedia.org/T345244 (10Clement_Goubert) 05Open→03In progress [11:06:56] 10serviceops, 10Patch-For-Review: Remove tls-proxy cpu limits on eventstreams - https://phabricator.wikimedia.org/T345243 (10Clement_Goubert) 05In progress→03Open [11:18:58] 10serviceops, 10MW-on-K8s, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-herron: Deploy StatsD exporter for Kubernetes - https://phabricator.wikimedia.org/T345970 (10Clement_Goubert) [12:19:44] 10serviceops, 10Observability-Alerting, 10observability: Investigate swagger-exporter failures - https://phabricator.wikimedia.org/T346893 (10fgiunchedi) [12:33:47] 10serviceops, 10Observability-Alerting, 10observability, 10Patch-For-Review: Investigate swagger-exporter failures - https://phabricator.wikimedia.org/T346893 (10fgiunchedi) [13:03:08] 10serviceops, 10MW-on-K8s: mcrouter daemonset of mw-on-k8s - https://phabricator.wikimedia.org/T346690 (10jijiki) [13:03:37] 10serviceops, 10MW-on-K8s: mcrouter daemonset of mw-on-k8s - https://phabricator.wikimedia.org/T346690 (10jijiki) [13:03:40] 10serviceops, 10SRE: Memcached, mcrouter in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10jijiki) [13:06:01] 10serviceops, 10MW-on-K8s: mcrouter daemonset of mw-on-k8s - https://phabricator.wikimedia.org/T346690 (10jijiki) [13:31:58] 10serviceops, 10SRE, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: MediaWiki - https://phabricator.wikimedia.org/T346474 (10kamila) [13:39:59] 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Discovery-Search (Current work): Improve concurrency limits configuration of the wdqs updater - https://phabricator.wikimedia.org/T346456 (10dcausse) a:03dcausse [13:47:16] 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Discovery-Search (Current work): Improve concurrency limits configuration of the wdqs updater - https://phabricator.wikimedia.org/T346456 (10Gehel) [14:06:34] 10serviceops, 10Observability-Alerting, 10Patch-For-Review: Investigate swagger-exporter failures - https://phabricator.wikimedia.org/T346893 (10lmata) [14:16:03] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) The banners were set and read: someone took the opportunity to [[ https://meta.wikimedia.org/w/inde... [14:16:06] 10serviceops, 10SRE, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: MediaWiki - https://phabricator.wikimedia.org/T346474 (10kamila) [14:17:04] 10serviceops, 10SRE, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover: list new primary DC servers first in debug.json - https://phabricator.wikimedia.org/T346472 (10kamila) [14:39:10] 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10Wikidata.org, and 3 others: [WD-ORG] Query service maxlag calculation should exclude datacenters that don't receive traffic and where the updater is turned off - https://phabricator.wikimedia.org/T331405 (10ItamarWMDE) [14:39:41] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Scrape controller-manager and scheduler metrics - https://phabricator.wikimedia.org/T324959 (10JMeybohm) a:05jijiki→03JMeybohm [14:40:31] 10serviceops, 10SRE, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: MediaWiki - https://phabricator.wikimedia.org/T346474 (10kamila) [15:14:04] 10serviceops, 10Content-Transform-Team, 10Content-Transform-Team-WIP, 10Parsoid, and 4 others: Requests originating from zhwiki wikifeeds caused parsoid outage - https://phabricator.wikimedia.org/T346657 (10Jgiannelos) [15:15:09] 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10Wikidata.org, and 3 others: [WD-ORG] Query service maxlag calculation should exclude datacenters that don't receive traffic and where the updater is turned off - https://phabricator.wikimedia.org/T331405 (10Lucas_Werkmeister_WMDE) 05Open→03Resolved... [15:18:34] 10serviceops, 10Observability-Metrics, 10Kubernetes: Refactor discovery of calico-felix targets in prometheus - https://phabricator.wikimedia.org/T346915 (10JMeybohm) p:05Triage→03Low [16:54:43] 10serviceops, 10Observability-Metrics, 10Prod-Kubernetes, 10Kubernetes: Refactor discovery of calico-felix targets in prometheus - https://phabricator.wikimedia.org/T346915 (10JMeybohm) [16:54:54] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Reduction of Secret-based Service Account Tokens - https://phabricator.wikimedia.org/T345892 (10JMeybohm) [16:55:20] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Wikikube staging clusters are out of IPv4 Pod IP's - https://phabricator.wikimedia.org/T345823 (10JMeybohm) [16:55:32] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Audit charts drift between staging and production - https://phabricator.wikimedia.org/T345839 (10JMeybohm) [16:56:05] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Allow parallel image pulls in k8s - https://phabricator.wikimedia.org/T344154 (10JMeybohm) [16:56:32] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10User-jijiki: Deploy kube-state-metrics - https://phabricator.wikimedia.org/T264625 (10JMeybohm) [16:58:00] 10serviceops, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes: Update Kubernetes clusters to >1.25 - https://phabricator.wikimedia.org/T341984 (10JMeybohm) [17:00:16] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Use a separate key for service account token issuer - https://phabricator.wikimedia.org/T275026 (10JMeybohm) 05Open→03Resolved a:03JMeybohm This has been resolved with the move to PKI in {T307943} [17:02:16] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Add templates for puppet_ca consumption - https://phabricator.wikimedia.org/T260964 (10JMeybohm) 05Open→03Declined Containers should have the wmf-certificates package installed which contains the puppet ca as well. [17:03:43] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Audit charts drift between staging and production - https://phabricator.wikimedia.org/T345839 (10JMeybohm) Linking to {T265979} at this is somewhat similar but not identical [17:06:42] 10serviceops, 10Prod-Kubernetes, 10User-fsero: Kubernetes clusters roadmap - https://phabricator.wikimedia.org/T212123 (10JMeybohm) 05Open→03Resolved a:03JMeybohm I'm going to resolve this one as we no longer use it [17:38:51] 10serviceops, 10Observability-Alerting, 10Patch-For-Review: Investigate swagger-exporter failures - https://phabricator.wikimedia.org/T346893 (10colewhite) a:03colewhite [20:10:43] 10serviceops, 10AQS2.0, 10Cassandra, 10SRE, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10Htriedman) > Some of them are just artifacts of starting from a fork of one of the legacy services. For example, we'll want to adop... [20:56:29] 10serviceops, 10MW-on-K8s, 10MediaWiki-Configuration, 10User-brennen, 10Wikimedia-production-error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T346971 (10Krinkle) [20:59:38] 10serviceops, 10MW-on-K8s: mcrouter daemonset on mw-on-k8s - https://phabricator.wikimedia.org/T346690 (10jijiki) [21:02:05] 10serviceops, 10MW-on-K8s, 10MediaWiki-Configuration, 10MediaWiki-Platform-Team (Radar), and 2 others: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T346971 (10Krinkle) Untagging MediaWiki-Platform-Team since #MediaWiki-Configuration does not have an o... [22:14:30] 10serviceops, 10MW-on-K8s, 10MediaWiki-Configuration, 10MediaWiki-Platform-Team (Radar), and 2 others: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T346971 (10Joe) The error condition is reached when there is no value in apc and a response can't be fe... [23:52:57] 10serviceops, 10observability, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q1): Hardcode the SLO time windows in Grafana dashboards generated via Grizzly - https://phabricator.wikimedia.org/T346144 (10RLazarus) This sounds right to me -- thanks @elukey for getting it rolling. Early on, we had talk...