[07:58:23] hi folks! [07:58:33] I'd need a quick review if anybody has time https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/909954/ [08:01:06] ah I think I can also add kafka-logging100[4,5] [08:03:35] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [08:09:27] elukey: there is this auto-generated kafka_brokers key in /etc/helmfile-defaults/general*.yaml - any idea why that is not used here? [08:09:31] good morning :) [08:10:51] jayme: morning! No idea, I was investigating some timeouts and I noticed the new brokers [08:10:52] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Marostegui) [08:11:35] I think that deployment-charts is not being updated at all during the move, and probably eventgate-analytics-external still uses an old format? [08:13:21] I was under the impression that the functionality was actually added for eventgate stuff, but I might be wrong [08:13:58] yeah probably [08:14:01] I also see something like [08:14:02] metadata.broker.list: kafka-logging1001.eqiad.wmnet:9093,kafka-logging1002.eqiad.wmnet:9093,kafka-logging1003.eqiad.wmnet:9093 [08:14:11] and this is wrong now, since 1001 and 1002 are down [08:14:31] I am going to send a patch to fix this, then we can ask to Keith/Andrew to coordinate [08:16:10] cool, thanks! Unfortunately it's a two-fold thing I guess. The eventgate config and the networkpolicy template ... [08:16:23] because the latter does not know about the FQDNs [08:18:11] yeah exactly [08:25:15] I have updated the code review :) [08:25:28] so my understanding from checking kafka brokers is that [08:25:35] 1) kafka-logging100[12] are gone [08:25:57] 2) 100[345] are now the new eqiad cluster [08:26:08] 3) codfw has 100[1-5] [08:26:19] err 200[1-5] [08:26:27] with 200[45] not doing anything [08:26:42] I was checking https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=eventgate&var-destination=All&from=1681093845411&to=1681192554246&viewPanel=17 [08:26:54] and it seems starting more-or-less when the eqiad brokers have changed [08:27:58] sounds insanely plausible :p [08:28:56] 10serviceops, 10Infrastructure-Foundations: Create a cookbook to help us depool *all* services from a datacentre - https://phabricator.wikimedia.org/T327665 (10Joe) 05Open→03Resolved [08:31:21] so let me understand, this was another "special edition" chart not using the standard methods? [08:31:38] because we've designed stuff around kafka to be driven by puppet [08:31:56] so if you change the kafka brokers there, that will be reflected in the values picked up on k8s [08:32:04] ofc, you need to do a deployment :) [08:32:27] I am not 100% sure but it seems a special chart indeed [08:36:38] elukey: an ottomata chart you mean? :P [08:40:36] joe: we have to admit that he was one of the first users of the deployment-pipeline, it makes sense that there are some custom things :) [08:41:29] elukey: sure, sure, but why can't I have some fun at andrew's expense? [08:41:30] :P [08:42:27] :D fair :D [08:50:35] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): 25 April 2023 Service Switchback checklist - https://phabricator.wikimedia.org/T335015 (10Clement_Goubert) [08:51:25] 10serviceops, 10SRE, 10Datacenter-Switchover: March 2023 Datacenter Switchover SRE-side Communication - https://phabricator.wikimedia.org/T329042 (10Clement_Goubert) [08:51:43] we should definitely open a taks to have this changed to "the standard" [08:52:04] +1 I updated all the tasks (observability and de ones) [08:52:32] but I'd like to merge my change if possible, so that we restore a good status [08:52:43] I am worried that the service is in half broken state [08:52:46] yes, yes. double checking now [08:53:39] <3 [08:56:25] looks good to me [08:57:09] ack thanks! Deploying then [09:02:04] deployed, I don't see the timeout issues anymore :) [09:02:47] nice [12:29:51] 10serviceops, 10MW-on-K8s, 10Push-Notification-Service, 10Patch-For-Review: Migrate push-notifications to mw-api-int - https://phabricator.wikimedia.org/T334061 (10Clement_Goubert) >>! In T334061#8786565, @Jgiannelos wrote: > We can try the following: > * Deploy the change only on staging k8s I've changed... [13:59:31] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Make more use of Calico network policy features - https://phabricator.wikimedia.org/T331894 (10JMeybohm) {T334510} is an incarnation of an issue which could probably have been circumvented by this as well [14:57:52] well well fun at my expense while I was sleeping. ʕ •́؈•̀) [14:58:00] 10serviceops, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 8 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ayounsi) [14:58:14] i also thought we already fixed the evengate chart to use the common template... [14:58:28] broker hostnames are needed for configs though, that has never been solved [14:58:48] 10serviceops, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 8 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ayounsi) [14:58:50] https://phabricator.wikimedia.org/T253058 [14:59:59] 10serviceops, 10DBA, 10Data-Engineering, 10Discovery-Search, and 8 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10Marostegui) [15:30:15] 10serviceops, 10DBA, 10Data-Engineering, 10Discovery-Search, and 8 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [17:09:03] 10serviceops, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10colewhite) [17:09:41] 10serviceops, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10colewhite) [17:49:46] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF) 05Open→03In progress p:05Triage→03High [17:49:55] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Trizek-WMF) [18:01:24] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF) [18:01:42] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF) [18:03:19] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF) [18:16:35] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF) [18:25:32] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF) [18:27:58] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF) [18:29:37] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF) a:05Trizek-WMF→03sgrabarczuk I did whatever I can that doesn't require checking.... [18:33:01] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover, 10User-notice: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Trizek-WMF)