[05:17:39] 10serviceops, 10Data-Persistence, 10Performance-Team, 10SRE, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10Marostegui) [05:45:13] 10serviceops, 10Data-Persistence, 10Performance-Team, 10SRE, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10Marostegui) [06:46:44] o/ trying to understand how the trafic to elasticsearch has changed after switch-over steps, mainly "depool all services in eqiad" (let's call it T1) and the "master switch" the next day (T2) [06:48:39] after T1 most trafic switched from elastic@eqiad to elastic@codfw, but there were still some "read" queries sent to elastic@eqiad [06:49:28] some seem to come from api_appserver (https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=api_appserver&var-origin_instance=All&var-destination=search-https_eqiad&from=1695021737562&to=1695278702741) [06:50:56] oh could it be POST request to the search APIs? [07:39:43] <_joe_> dcausse: given it doesn't use discovery DNS, I guess it's something in software [07:40:52] <_joe_> dcausse: well until yesterday, POSTs to the api were going to eqiad [07:41:10] <_joe_> so yeah, you're right [07:42:48] _joe_: thanks, I'll check and identify such requests, these are mainly morelike requests (very costly) and probably the reason why we got perf issues only after the master switch and not after depooling eqiad services [07:44:18] 10serviceops, 10collaboration-services, 10GitLab (CI & Job Runners): Create a staging apt repository for CI-based builds of Debian packages - https://phabricator.wikimedia.org/T347004 (10MoritzMuehlenhoff) [07:45:58] <_joe_> dcausse: yep, checks out [08:09:25] Hi! Can we re-enable traffic on zhwiki aggregate feeds? https://phabricator.wikimedia.org/T346657#9183654 [08:09:45] *featured feeds [08:12:52] 10serviceops, 10Patch-For-Review: Setup kubernetes20[25-53] - https://phabricator.wikimedia.org/T345709 (10Joe) a:03Joe [08:18:12] nemo-yiannis: yeah, I think so. If you feel ok with it, fine by me [08:18:15] _joe_: ^ [08:18:22] any objections? or should I do it? [08:19:22] <_joe_> akosiaris: go on please [08:19:27] ok [08:19:28] <_joe_> just disable the rule [08:19:30] <_joe_> and commit [08:23:05] nemo-yiannis: done, I 've updated the task as well. [08:23:09] thanks [08:24:26] 10serviceops, 10Content-Transform-Team, 10Content-Transform-Team-WIP, 10Parsoid, and 4 others: Requests originating from zhwiki wikifeeds caused parsoid outage - https://phabricator.wikimedia.org/T346657 (10akosiaris) I 've just disabled the rule. It's still present, but inactive. For other SREs having to... [08:25:06] i will keep an eye on parsoid/wikifeeds metrics [08:27:30] thanks! [08:40:36] Antoine helped me to find possible POST requests to the search api and found https://gerrit.wikimedia.org/g/mediawiki/services/restbase/+/c8915660aa2025d1695db797088b932b4c1d6215/v1/related.js#60 [08:59:32] 10serviceops, 10observability, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q1): Hardcode the SLO time windows in Grafana dashboards generated via Grizzly - https://phabricator.wikimedia.org/T346144 (10elukey) >>! In T346144#9160331, @herron wrote: > +1 for trying this. Thinking out loud: > > 1)... [09:16:22] FYI, I'll redeploy eventgate-analytics, eventgate-analytics-external and evenstreams-internal in both eqiad and codfw within the hour, to pick up a new kafka config [09:27:40] !log redeploying eventstreams-internal in eqiad T336041 [09:30:10] !log redeploying eventstreams-internal in codfw T336041 [09:32:51] !log redeploying eventgate-analytics in eqiad T336041 [09:34:15] !log redeploying eventgate-analytics in codfw T336041 [09:36:01] !log redeploying eventgate-analytics-external in eqiad T336041 [09:38:10] !log redeploying eventgate-analytics-external in codfw T336041 [09:38:43] <_joe_> brouberol: we don't have stashbot in this channel :) [09:38:56] all done! [09:39:06] and helmfile will log for you in #-operations :) [09:39:12] <_joe_> exactly yeah :) [09:39:41] elukey suggested I kept you in the loop (over at #-analytics) [09:41:01] appreciated! But the one headsup is good enough [09:42:36] noted! [09:49:55] yeah exactly, sorry I didn't specify :) [10:05:40] <_joe_> elukey: I don't think you can be forgiven... [10:10:24] can we reboot conf1* in the current one week window, is that already planned for by someone? [10:43:27] <_joe_> err no [10:43:41] <_joe_> jayme: fancy switching etcd over to codfw? [10:43:57] <_joe_> moritzm: we might once we switch over at least read clients [10:44:45] <_joe_> in any case, ask Janis, he's our etcd expert now [10:46:00] let me get on my bike, such expertise warrants a brass sign on his postbox [10:57:00] <_joe_> ahahahaha [11:27:42] 10serviceops, 10Content-Transform-Team-WIP, 10RESTBase Sunsetting, 10Wikifeeds, and 2 others: Switchover plan from restbase to api gateway for wikifeeds - https://phabricator.wikimedia.org/T339119 (10Jgiannelos) 05Open→03Resolved [11:36:06] moritzm: you've said to come over this week anyways :p [11:36:34] and no, not at all interested in doint that :D [11:50:12] 10serviceops, 10collaboration-services, 10GitLab (CI & Job Runners): Create a staging apt repository for CI-based builds of Debian packages - https://phabricator.wikimedia.org/T347004 (10eoghan) a:03eoghan [11:55:13] <_joe_> jayme: I was being british-polite. "Fancy doing" means "please do" [11:55:54] sure, I do unterstand you - still not interested :) [11:58:01] that involves switching etcd mirror going codfw-eqiad first AIUI. We can maybe chat about that early next week [11:58:24] just learned (again) that I unfortunately have to prep for incident review on monday [12:25:21] 10serviceops, 10Patch-For-Review: Setup kubernetes20[25-53] - https://phabricator.wikimedia.org/T345709 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes2028.codfw.wmnet with OS bullseye [12:43:42] 10serviceops, 10collaboration-services, 10GitLab (CI & Job Runners): Create a staging apt repository for CI-based builds of Debian packages - https://phabricator.wikimedia.org/T347004 (10eoghan) [13:05:52] 10serviceops, 10Patch-For-Review: Setup kubernetes20[25-53] - https://phabricator.wikimedia.org/T345709 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes2028.codfw.wmnet with OS bullseye completed: - kubernetes2028 (**PASS**) - Downtimed on Icinga/Alert... [13:16:30] 10serviceops, 10Data-Persistence, 10Performance-Team, 10SRE, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10Jelto) [13:57:51] 10serviceops, 10MW-on-K8s, 10MediaWiki-Platform-Team: mcrouter daemonset on mw-on-k8s - https://phabricator.wikimedia.org/T346690 (10Krinkle) > Make the `$wgObjectCaches['mcrouter']['servers']` an environmental variable we can define in `values.yaml`. This item is a request for our team. Discussed with SRE... [14:19:36] 10serviceops, 10Content-Transform-Team-WIP, 10Parsoid, 10RESTBase, and 2 others: Requests originating from zhwiki wikifeeds caused parsoid outage - https://phabricator.wikimedia.org/T346657 (10MSantos) p:05Triage→03Medium [14:33:50] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) [14:33:54] 10serviceops, 10MW-on-K8s, 10MediaWiki-Platform-Team: mcrouter daemonset on mw-on-k8s - https://phabricator.wikimedia.org/T346690 (10Joe) Please note that modifying the mediawiki code won't be enough. we also need to allow php-fpm to access environment variables coming from the process environment [14:37:40] 10serviceops, 10MW-on-K8s, 10MediaWiki-Platform-Team: mcrouter daemonset on mw-on-k8s - https://phabricator.wikimedia.org/T346690 (10Joe) I pressed submit before finishing my comment: that amounts to setting `clear_env = no` in php-fpm I think. [14:40:13] 10serviceops, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes: Update Kubernetes clusters to >1.25 - https://phabricator.wikimedia.org/T341984 (10JMeybohm) [14:40:18] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Scrape controller-manager and scheduler metrics - https://phabricator.wikimedia.org/T324959 (10JMeybohm) 05Open→03Resolved I've added two dashboards: https://grafana.wikimedia.org/d/jVM2D3mSk/kubernetes-controller-manager https://grafana.wikimedia.org/d/UPk... [14:40:23] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [14:45:17] 10serviceops, 10Content-Transform-Team-WIP, 10Parsoid, 10RESTBase, and 2 others: Requests originating from zhwiki wikifeeds caused parsoid outage - https://phabricator.wikimedia.org/T346657 (10akosiaris) I 've just re-enabled the filter, rejecting traffic, we are meeting issues with high latencies and decr... [14:46:36] hi o/ [14:46:57] ml team, enterprise team, and research team have an idea of using the flink solution of the event platform for a PoC of a new stream (revertrisk stream) [14:47:11] we know that a flink operator has been installed on wikikube. we're thinking of deploying the poc to wikikube, so ml-team can gain insights from it and evaluate if we want to add flink capabilities to our infra to provide streams for the models hosted on lift wing. [14:47:17] what do service ops think about it? [14:49:16] aiko: there is the dse cluster (https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters#dse-k8s) which has flink-operater installed. Sounds like a good fit for this [14:52:59] yeah, PoCs involving flink are better suited to the dse cluster where development and testing takes place. Once you got something you are happy with, you can deploy it to WikiKube or mlserve, depending on scope and context. [14:57:54] the enterprise team is committed to have a stream that ends up published to Event Streams, just as FYI [14:58:39] in ML we were wondering if we'd need to get a flink operator and get streams that are "ML-Related", even if the ownership is fuzzy at this point (is Event Platform owning them? Are single teams owning them? Etc..) [15:06:15] have a stream == ? Own and produce to it? or consume it? [15:06:15] 10serviceops, 10RESTBase Sunsetting, 10Code-Health-Objective, 10Data Products (Sprint 01), 10Patch-For-Review: Route to new AQS Knowledge Gaps endpoint - https://phabricator.wikimedia.org/T342213 (10Milimetric) AQS 1.0 is sending the required headers now, etag is enabled on all endpoints (not just knowle... [15:06:15] jayme, akosiaris: thanks for the information! that's a good point. we'll look into if dse cluster fits this use case. [15:07:34] elukey: good question. A fuzzy ownership isn't something good to start with. I 'd say the first thing to do is make sure to clear that up. [15:07:37] just to double check: Is eqiad fully depooled from traffic? Can I run schema changes that will take a half a day with replication [15:07:47] Amir1: yes, go ahead [15:07:50] AWESOME [15:09:00] the "what" makes something a fit for mlserve vs wikikube is a good question. I think we kinda set some ground rules on https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters#Goal [15:09:44] MLserve's goal is less clearly stated [15:10:04] perhaps because I wrote it and I tried to not tell the ml team what to do [15:10:15] but rather let the ML team define that [15:20:31] akosiaris: when are planning to pool eqiad again. I want to avoid a big oopsie on my side [15:21:00] Amir1: plan says Wednesday, but we might be forced to do so earlier. [15:21:22] we 'll let you know if we do [15:21:27] please ping me if that becomes the case [15:21:28] thanks [16:04:56] 10serviceops, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q1): Identify path forward for k8s deployment of prometheus-statsd-exporter - https://phabricator.wikimedia.org/T343025 (10Joe) 05Open→03In progress a:03Joe [16:36:20] 10serviceops, 10observability, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q1): Hardcode the SLO time windows in Grafana dashboards generated via Grizzly - https://phabricator.wikimedia.org/T346144 (10herron) >>! In T346144#9185491, @RLazarus wrote: > I think it wouldn't even need to be "make edit... [16:47:32] 10serviceops, 10Abstract Wikipedia team, 10WikiLambda, 10function-orchestrator: Come up with a way to make Wikifunctions calls not keep a PHP process alive whilst waiting for the backend - https://phabricator.wikimedia.org/T345289 (10Jdforrester-WMF) p:05Triage→03Medium [17:34:42] 10serviceops, 10AQS2.0, 10Cassandra, 10SRE, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10Eevans) To summarize a meeting between @Htriedman and myself: The current API attempts to return results for one of page ID (the m...