[08:16:58] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10ayounsi) [08:17:07] hnowlan: o/ lemme know if you have time today to help/assist in the changeprop deployment :) [08:17:27] (for serviceops: let us know if today is fine or not) [08:23:19] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update staging-eqiad to k8s 1.23 - https://phabricator.wikimedia.org/T327664 (10JMeybohm) [08:26:09] elukey: I don't see why not [08:40:48] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update staging-eqiad to k8s 1.23 - https://phabricator.wikimedia.org/T327664 (10JMeybohm) [08:41:23] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [08:41:25] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update staging-eqiad to k8s 1.23 - https://phabricator.wikimedia.org/T327664 (10JMeybohm) 05Open→03Resolved All looked good today. Updated deployments and switched staging back to eqiad. [08:52:16] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Import coredns 1.8.x (k8s 1.23 dependency) - https://phabricator.wikimedia.org/T321159 (10JMeybohm) 05Open→03Resolved New coredns image and chart are running fine in staging, I think we can resolve this. I've added archiving the old rep... [08:52:23] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [08:53:07] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [09:16:06] o/ IIUC the k8s staging cluster might switch DC from times to times, we have an app that may require different settings depending on the DC it's running on, would you have objections if we made our helmchart have distinct values for these, e.g. value-staging-eqiad.yaml and values-staging-codfw.yaml instead of a common values-staging.yaml? [09:18:17] dcausse: I think the latter is better, we should do it for other things as well (changeprop etc..) [09:18:46] but let's wait for others before proceeding, not sure if there are problems with this approach [09:19:53] sure! I'm just wondering and it's staging after all so no big deal at all [09:20:42] I had a similar issue when testing changeprop in staging, both clusters pulls from kafka main eqiad and testing wasn't easy (codfw pods consuming from eqiad etc..) [09:21:05] yes, almost same problem here [09:21:16] 10serviceops, 10Infrastructure-Foundations, 10netops: Optimize k8s same row traffic flows - https://phabricator.wikimedia.org/T328523 (10ayounsi) [09:35:15] elukey: today is a bit busy for me - I have an appointment in a little while which might run long, but I will let you know when I'm back [09:39:03] dcausse: elukey: the problem with this is that we (in helmfile) do not differentiate between the two staging clusters currently. E.g. adding DC specific values files (like value-staging-codfw.yaml) would have no effect [09:40:35] to get this working one would have to specify both staging clusters as environments in the helmfile. That would ofc mean you'd have to always define exactly which cluster/DC you want to deploy to [09:40:43] hnowlan: sure thanks! [09:41:16] jayme: yep yep understood, but it seems ok to me [09:41:33] not for me :-p [09:42:11] as it makes the services different from all others. E.g. everything can be deployed to the "staging" environment, while some things can't [09:43:01] there might be a smart way out of this, though...by checking somehow (in helmfile) what the "active" staging cluster is [09:43:40] and then selecting the appropriate values file [09:44:43] yeah if we do it we should change all helmfiles, not only some [09:47:51] in that case a deployer would still need to know what the active staging cluster is and deploy there [09:48:21] that's what I would like to avoid (having to know that) [09:50:31] or to deploy to both and check the active one somehow (we do deploy to both dcs in prod) [09:50:38] but I agree that it may be misleading [10:03:43] jayme: ok understood, we can keep a single staging and adapt the config when staging migrates (hopefully it's not something that happens regularly) [10:04:20] You can make both configs and symlink [10:04:22] I would like to keep the staging-codfw cluster out of default deployment path as we use that one mostly for experimenting with k8s stuff [10:04:51] Easier and less error-prone than changing the values manually [10:05:03] claime: sure, good idea! [10:05:07] jayme: makes sense [10:05:11] dcausse: it does not happen very often indeed. It has been the second time since 2020 :) [10:05:16] ok :) [10:05:38] and also ususally for a very short period only, like ~24h [10:14:27] the only thing that I'd try to fix is what to do with kafka-based consumers, like changeprop [10:15:24] if pods pull from each staging dc it may happen that you deploy code to the active eqiad staging cluster and see rules processed in codfw (because the kafka consumer elected for the partition is there) [10:16:12] it took me a lot of time to figure out why eqiad wasn't showing any sign of rules being processed [10:17:29] hmm, it's one of those things that our design decision to "hide" the DC from services is incompatible with some services. [10:17:50] granted, it's staging so the cost is probably just debugging time, right ? [10:18:37] but maybe it's a sign we need to figure out how to abstract some of the dependent resources (e.g. the DC for the relevant kafka clusters) for these services ? [10:18:53] not that I have a good idea to do that in mind right now. [10:19:12] <_joe_> good point [10:19:38] <_joe_> I have a couple vague ideas [10:19:48] there's a few more similar things, e.g. rdb hosts for changeprop/cp-jobqueue/api-gateway [10:19:57] <_joe_> this circles back to us not having a "datacenter" concept in charts [10:20:12] <_joe_> yeah, nod [10:20:39] <_joe_> I'm sure one could make some magic in helmfile.yaml to include the right "dc-global" file [10:20:45] indeed, having a DC abstraction could solve some of these issues [10:20:53] <_joe_> but I'd rather not do that [10:21:12] <_joe_> tbh there's no way around splitting the staging envs [10:21:28] the thing is, the codfw staging env is always in a state of flux [10:21:32] <_joe_> because else, to helmfile, staging-eqiad and staging-codfw are both "staging" [10:21:48] and almost always depooled, unless we reimage the eqiad staging env [10:22:02] <_joe_> yeah we could have some global variable that tells us which of those is "staging" [10:22:11] and by depooled I just mean that the symlink that implements this points to staging-eqiad [10:22:11] <_joe_> that's more or less all I can think of [10:22:18] <_joe_> yeah nod [10:22:25] <_joe_> the current status is clear to me [10:22:54] there are a number of other tangential questions ofc [10:23:08] does it even make sense to have a staging env for services? [10:23:32] and by staging, I mean the pretty strict definition of a safety net [10:23:53] not the "oh, here's an environment where anyone can do anything" aka the beta cluster [10:24:07] and I ask that question because we have safety nets now in the production clusters too [10:24:20] things like auto-rollback, canaries, etc [10:25:03] I like a lot the staging config for changenprop, it allowed me to iterate multiple times (with Hugh) to find what the error was.. For example, setting logging to tracing, using test kafka topics, etc.. [10:25:07] <_joe_> yeah tbh it's useful for me sometimes [10:25:30] same here [10:25:39] the other related question is, can this functionality be done via the production clusters ? [10:25:41] I think it is in general because you can more safely poke around on things [10:25:48] <_joe_> maybe it could just be a release with 1 pod in the "eqiad"/"codfw" clusters [10:25:52] e.g. nothing stops us from having a staging release in the production cluster [10:25:54] <_joe_> akosiaris: jinx! [10:26:00] :) [10:26:01] <_joe_> great minds... [10:26:37] the downside might be one needs to be more careful [10:26:38] <_joe_> so how you deploy a "staging" container would change slightly [10:26:51] <_joe_> jayme: or we write a nice wrapper on top of helmfile :D [10:26:54] and we fully skip the idea of a separate cluster, just implement it via another helmfile release, create some exp* cluster for SREs to continue their use cases and fold in capacity of staging clusters to production [10:27:01] <_joe_> how many wrappers must an SRE use [10:27:25] that is assuming the staging use cases for services are exactly that. A safety net to figure out what gets deployed is correct [10:27:35] <_joe_> this is stuff for the k8s-sig channel I think [10:27:37] and perhaps iterate on fixing it if it isn't [10:27:46] that's true [10:28:22] <_joe_> akosiaris: I was also wondering - for such changes maybe we could write a document on which we could require comments [10:28:29] <_joe_> to the sig [10:28:32] it's an interesting conversation that needs some clear definitions, clear scope, document nicely the use cases [10:28:34] <_joe_> and have a process to approve those [10:28:38] and then just decide what to do [10:28:46] akosiaris: great minds :P [10:52:54] maybe we could do something like this short term: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/885791 [10:54:52] the active cluster can be derived from the kubernetes_cluster_groups structure in hiera [10:55:09] and dumped into the general yaml for example [11:30:34] 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-notice, 10Wikimedia-Hackathon-2022: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 (10Ladsgroup) Next is Babel. I need to write a script to analyze... [13:14:09] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Metrics changes with Kubernetes v1.23 - https://phabricator.wikimedia.org/T322919 (10JMeybohm) [13:15:16] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Metrics changes with Kubernetes v1.23 - https://phabricator.wikimedia.org/T322919 (10JMeybohm) a:03JMeybohm [13:29:45] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Metrics changes with Kubernetes v1.23 - https://phabricator.wikimedia.org/T322919 (10JMeybohm) [13:36:53] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Metrics changes with Kubernetes v1.23 - https://phabricator.wikimedia.org/T322919 (10JMeybohm) [13:46:38] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: Optimize k8s same row traffic flows - https://phabricator.wikimedia.org/T328523 (10ayounsi) [13:59:53] akosiaris: I was fixing the disc_desired_state script when _joe_ reminded me that we may have discussed removing it completely [14:14:22] 10serviceops, 10RESTbase Sunsetting, 10Epic, 10Platform Engineering Roadmap: Replace usage of RESTbase parsoid endpoints - https://phabricator.wikimedia.org/T328559 (10daniel) [14:19:43] 10serviceops, 10Infrastructure-Foundations: Create a cookbook to help us depool *all* services from a datacentre - https://phabricator.wikimedia.org/T327665 (10Joe) p:05Triage→03High a:03Joe I am going to work on this given we have planned outages for entire rows [14:22:46] 10serviceops, 10RESTbase Sunsetting, 10Epic, 10Platform Engineering Roadmap: Replace usage of RESTbase parsoid endpoints - https://phabricator.wikimedia.org/T328559 (10daniel) [14:24:05] claime: well, I don't think anybody actually uses it, so I 'd say submit a change to remove it and see who, if anyone, complains? [14:24:13] I was the original perp and I won't. [14:56:54] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: Calico and BFD - https://phabricator.wikimedia.org/T328338 (10ayounsi) As a datapoint, I pushed this change https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/885814 and it showed up in the Bird config without a drop in the BGP sessi... [15:02:20] <_joe_> claime: add https://phabricator.wikimedia.org/T327663 as a related task :P [15:38:45] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: Optimize k8s same row traffic flows - https://phabricator.wikimedia.org/T328523 (10ayounsi) With the above patch, plus the following test router config: `lang=diff [edit policy-options] + policy-statement kubestage_test_out { + term stage... [16:01:40] hey everyone, I have a restbase cache problem [16:01:47] an event vandalized a template in eswiki that affects 270k articles. PCS is stuck in the restbase cache. Is it safe to proceed with a cache refresh for these articles? If yes, do you happen to know any better way of doing this bulk cache invalidation? See https://phabricator.wikimedia.org/T271184#8577052 [16:12:08] akosiaris ^ [16:12:46] 270k articles? wow [16:13:07] in a meeting, I 'll have a look later [16:13:16] _joe_: ^ [16:13:25] he's in a meeting too >< [16:13:47] same meeting I am in [16:13:52] true [16:13:53] this is for later :-) [16:14:31] <_joe_> yeah sorry, but I remember we needed to do it once [16:15:00] <_joe_> now in theory when the template has been reverted, we should've recalculated also the pages for restbaze/pcs [16:19:43] not sure how harmful it's to wait it to catch up, but we do have known issues on articles failing to refresh cache that needs manual intervention https://phabricator.wikimedia.org/T226931 [16:25:45] <_joe_> mbsantos: not *every* url [16:26:13] <_joe_> and most importantly, if the url is still in the restbase cache, we need to purge from there [16:26:55] <_joe_> so, can you please investigate if that's the case? [16:27:52] I'll investigate that [17:40:57] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) a:03Trizek-WMF [17:52:36] 10serviceops, 10SRE, 10Continuous-Integration-Config, 10Release-Engineering-Team (Seen): operations/docker-images/production-images has no CI - https://phabricator.wikimedia.org/T283855 (10hashar) [18:01:30] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) [18:02:41] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) [18:02:55] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) @Clement_Goubert Has anything major changed in your process since the last time (noticeable things that... [18:14:38] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) p:05Triage→03High [18:20:22] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) [18:30:18] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) As you gave 3 dates in the task description, can you confirm precisely **when** the wikis will be in a...