[08:55:23] hnowlan: I am trying to debug last weekends wikifeeds/parsoid outage. Is there any way I can cURL rest-gateway directly so I can check the headers? [09:13:05] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Kubernetes v1.23 use PKI for service-account signing (instead of cergen) - https://phabricator.wikimedia.org/T329826 (10JMeybohm) [09:20:44] 10serviceops, 10Content-Transform-Team-WIP, 10RESTBase Sunsetting, 10Wikifeeds, and 2 others: Switchover plan from restbase to api gateway for wikifeeds - https://phabricator.wikimedia.org/T339119 (10MSantos) a:03Jgiannelos [09:26:14] hello folks [09:29:27] I have https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/957687 lined up to stop changeprop from hitting ORES (!!!) [09:29:50] the main side effect will be to turn off ORES' Redis cache, since we'll not hit the /precache endpoint anymore [09:30:12] when deployed I'll alert people over wikitech-l since it will increase latency [09:30:29] but we need to stop it before being able to switch ores to ores-legacy [09:30:33] any concern? [09:31:05] + 💯 [09:31:17] \o/ [09:31:48] elukey: btw, we had stopped changeprop for quite a bit of time on Saturday while debugging an issue with wikifeeds [09:32:05] if you heard any complaints, I 'd like to know [09:32:17] ah ack didn't see it, sure! [09:32:51] I'd like to see if anybody complains as well, hopefully not [09:33:01] changeprop is 90% of ORES' traffic [09:43:29] nemo-yiannis: curl -v -H 'de.wikipedia.org' https://rest-gateway.discovery.wmnet:4113/de.wikipedia.org/v1/feed/featured/2023/08/20 seems to work [09:43:49] nemo-yiannis: (from inside the infra ofc) [09:58:58] (changeprop deployment done) [09:59:28] (\o/) [10:01:50] Next monday, if nothing explodes in the meantime, we'll point ores.wikimedia.org to ores-legacy.wikimedia.org [10:02:26] I am not confident that the first switch will be 100% perfect, but if so after that we will be able to decom ores bare metal + redis instances [10:48:52] thanks claime [11:16:51] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [11:16:57] 10serviceops, 10MW-on-K8s: mw-on-k8s tls-proxy container CPU throttling at low average load - https://phabricator.wikimedia.org/T344814 (10Clement_Goubert) 05In progress→03Resolved p50 latency increased slightly, we may want to up the concurrency a little to see what shakes. Example mw-web eqiad {F37730389... [11:24:18] 10serviceops, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1): Identify path forward for k8s deployment of prometheus-statsd-exporter - https://phabricator.wikimedia.org/T343025 (10Clement_Goubert) The prometheus-statsd-exporter container and configuration is already deployed as a side-car for... [11:33:00] 10serviceops, 10MediaWiki-Platform-Team, 10MediaWiki-extensions-CentralAuth, 10Stewards-and-global-tools, and 2 others: Accounts taking 30+ minutes to autocreate on metawiki/loginwiki (2023-05) - https://phabricator.wikimedia.org/T336627 (10Clement_Goubert) [11:48:53] <_joe_> nemo-yiannis: to clarify, this is not the first outage induced by that pattern of requests [11:50:17] 10serviceops, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1): Identify path forward for k8s deployment of prometheus-statsd-exporter - https://phabricator.wikimedia.org/T343025 (10Joe) @Clement_Goubert yes that is my plan. I planned on picking up the task next week after the switchover dust se... [12:00:12] 10serviceops, 10Kubernetes: Reduction of Secret-based Service Account Tokens - https://phabricator.wikimedia.org/T345892 (10JMeybohm) [12:26:26] 10serviceops, 10MediaWiki-extensions-CentralAuth, 10Stewards-and-global-tools, 10WMF-JobQueue, and 2 others: Accounts taking 30+ minutes to autocreate on metawiki/loginwiki (2023-05) - https://phabricator.wikimedia.org/T336627 (10larissagaulia) [12:27:59] 10serviceops, 10MediaWiki-libs-ObjectCache, 10MediaWiki-Platform-Team (Radar), 10User-jijiki, 10Wikimedia-Performance-recommendation: Use php-hrtime monotonic clock instead of microtime for perf measure in MW - https://phabricator.wikimedia.org/T245464 (10larissagaulia) [12:32:11] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 (10JMeybohm) [12:43:02] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 (10JMeybohm) [12:48:18] 10serviceops, 10SRE, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: MediaWiki - https://phabricator.wikimedia.org/T346474 (10kamila) [13:03:15] 10serviceops, 10SRE, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover: list new primary DC servers first in debug.json - https://phabricator.wikimedia.org/T346472 (10kamila) [13:13:44] iiuc we are (or we almost are) cergen-free in all k8s clusters, this is a really long and great work jayme :) [13:14:30] <3 thanks! [13:15:03] whew, gg \o/ [13:19:53] I'll merge the actual removal tomorrow because chicken... [13:20:42] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 (10JMeybohm) [13:21:26] Ah yes, too chicken to do it today, so might as well do it during the switchover :D [13:21:39] ahah [13:22:09] more like "I want to leave the stuff running for a bit before removing the puppet code and certs" [13:23:32] Sure sure :p [13:24:53] 10serviceops, 10observability, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q1): Hardcode the SLO time windows in Grafana dashboards generated via Grizzly - https://phabricator.wikimedia.org/T346144 (10elukey) @RLazarus What do you think? :) [14:01:33] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 (10JMeybohm) [14:10:48] 10serviceops, 10Machine-Learning-Team: Rename the envoy's uses_ingress option to sets_sni - https://phabricator.wikimedia.org/T346638 (10elukey) [14:23:30] 10serviceops, 10Release-Engineering-Team, 10Scap, 10Patch-For-Review: restbase deploys via scap lead to all hosts being disabled in conftool - https://phabricator.wikimedia.org/T346354 (10CodeReviewBot) jnuche opened https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/187 deploy: backward comp... [14:45:42] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1036.eqiad.wmnet with OS bullseye [14:45:49] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1036.eqiad.wmnet with OS bullseye executed with errors: - kubernetes10... [14:46:41] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 (10JMeybohm) [14:54:25] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1038.eqiad.wmnet with OS bullseye [14:54:32] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1038.eqiad.wmnet with OS bullseye executed with errors: - kubernetes10... [14:58:08] 10serviceops, 10Machine-Learning-Team, 10Patch-For-Review: Rename the envoy's uses_ingress option to sets_sni - https://phabricator.wikimedia.org/T346638 (10JMeybohm) [15:29:25] 10serviceops, 10Data-Engineering, 10Discovery-Search (Current work), 10Event-Platform: Improve the flink-app chart to provide more useful defaults - https://phabricator.wikimedia.org/T346315 (10EBernhardson) [15:29:42] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1038.eqiad.wmnet with OS bullseye [15:29:46] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1047.eqiad.wmnet with OS bullseye [15:46:19] 10serviceops, 10Discovery-Search, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech: Improve concurrency limits configuration of the wdqs updater - https://phabricator.wikimedia.org/T346456 (10EBernhardson) [15:48:06] 10serviceops, 10Observability-Metrics, 10Patch-For-Review: Decide on default histogram buckets for MediaWiki timers - https://phabricator.wikimedia.org/T344751 (10Krinkle) Due to limitations on per-minute percentiles in Graphite, timing values are relatively rare in MediaWiki today. The vast majority of the... [15:48:32] 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Discovery-Search (Current work): Improve concurrency limits configuration of the wdqs updater - https://phabricator.wikimedia.org/T346456 (10TJones) [15:48:52] 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Discovery-Search (Current work): Improve concurrency limits configuration of the wdqs updater - https://phabricator.wikimedia.org/T346456 (10EBernhardson) [15:52:02] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1036.eqiad.wmnet with OS bullseye [15:53:12] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1038.eqiad.wmnet with OS bullseye completed: - kubernetes1038 (**PAS... [15:57:18] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1047.eqiad.wmnet with OS bullseye completed: - kubernetes1047 (**PAS... [15:58:14] 10serviceops, 10Observability-Metrics, 10Patch-For-Review: Decide on default histogram buckets for MediaWiki timers - https://phabricator.wikimedia.org/T344751 (10akosiaris) The range and sizes of buckets in the histogram can be defined per metric (actually group of metrics, e.g. via a regex). We already use... [15:59:59] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10UOzurumba) [16:01:27] 10serviceops, 10RESTBase Sunsetting, 10Code-Health-Objective, 10Data Products (Sprint 01), 10Patch-For-Review: Route to new AQS Knowledge Gaps endpoint - https://phabricator.wikimedia.org/T342213 (10Milimetric) a:03Milimetric [16:01:31] 10serviceops, 10Observability-Metrics, 10Patch-For-Review: Decide on default histogram buckets for MediaWiki timers - https://phabricator.wikimedia.org/T344751 (10akosiaris) > with overrides being configured within WMF production wiring, as opposed to provided by the software. That imho violates the separati... [16:04:06] 10serviceops, 10Release-Engineering-Team, 10Scap, 10Patch-For-Review: restbase deploys via scap lead to all hosts being disabled in conftool - https://phabricator.wikimedia.org/T346354 (10CodeReviewBot) jnuche merged https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/187 deploy: backward comp... [16:13:40] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1036.eqiad.wmnet with OS bullseye completed: - kubernetes1036 (**PAS... [16:16:01] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jhancock.wm) [16:21:01] 10serviceops, 10Release-Engineering-Team, 10Scap: restbase deploys via scap lead to all hosts being disabled in conftool - https://phabricator.wikimedia.org/T346354 (10jnuche) Really sorry about this issue, I have just deployed a fix to production. @hnowlan, would it be possible for you to do another deploy... [17:23:19] 10serviceops, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Data-Platform-SRE, and 2 others: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10akosiaris) a:05akosiaris→03None [18:45:12] 10serviceops, 10SRE: Run httpbb periodically - https://phabricator.wikimedia.org/T289202 (10RLazarus) a:05RLazarus→03None [18:45:18] 10serviceops, 10SRE: Run httpbb periodically - https://phabricator.wikimedia.org/T289202 (10RLazarus) a:03RLazarus [18:45:34] 10serviceops, 10SRE, 10Wikimedia-Apache-configuration: Investigate and restore K.A.Z httpbb test - https://phabricator.wikimedia.org/T289022 (10RLazarus) a:05RLazarus→03None [18:45:46] 10serviceops, 10SRE, 10Wikimedia-Apache-configuration: Investigate and restore K.A.Z httpbb test - https://phabricator.wikimedia.org/T289022 (10RLazarus) a:03RLazarus [19:56:16] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10UOzurumba) [20:25:43] 10serviceops, 10Observability-Metrics, 10Patch-For-Review: Decide on default histogram buckets for MediaWiki timers - https://phabricator.wikimedia.org/T344751 (10colewhite) >>! In T344751#9175304, @Krinkle wrote: > The handful of timing metrics we have (and actively make use of) vary a lot in their range. I... [20:26:11] 10serviceops, 10Observability-Metrics, 10Patch-For-Review: Decide on default histogram buckets for MediaWiki timers - https://phabricator.wikimedia.org/T344751 (10colewhite) Interestingly, if StatsLib creates `executeTiming_seconds_bucket` as a counter and `executeTiming_seconds` as a timer and sends them to... [20:46:36] 10serviceops, 10RESTBase Sunsetting, 10Code-Health-Objective, 10Data Products (Sprint 01), 10Patch-For-Review: Route to new AQS Knowledge Gaps endpoint - https://phabricator.wikimedia.org/T342213 (10Milimetric) @hnowlan: TL;DR; do you see the `cache-control` that AQS is already setting and do we need an... [21:08:06] 10serviceops, 10RESTBase Sunsetting, 10Code-Health-Objective, 10Data Products (Sprint 01), 10Patch-For-Review: Route to new AQS Knowledge Gaps endpoint - https://phabricator.wikimedia.org/T342213 (10BTullis) >>! In T342213#9176546, @Milimetric wrote: > @hnowlan: TL;DR; do you see the `cache-control` that... [21:30:51] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10UOzurumba) [22:46:31] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jhancock.wm) 05Open→03Resolved