[01:01:15] <wikibugs>	 10serviceops, 10DBA, 10WMF-General-or-Unknown, 10Patch-For-Review: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Legoktm) With very aggressive rounding up to nearest minute, the enwiki run takes 42 minutes. commonswiki is 8.8 hours, wikidatawiki is ~50 minutes. I'm tagging #...
[03:28:47] <wikibugs>	 10serviceops, 10DBA, 10WMF-General-or-Unknown, 10Patch-For-Review: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Legoktm)
[05:02:08] <wikibugs>	 10serviceops, 10SRE: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10KartikMistry) Upgrade note: node14 has removed symlink of nodejs -> node command.
[05:16:06] <wikibugs>	 10serviceops, 10DBA, 10WMF-General-or-Unknown, 10Patch-For-Review: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Marostegui) @Legoktm this queries would go to the replicas on the slow section for MW right?
[05:16:21] <wikibugs>	 10serviceops, 10DBA, 10WMF-General-or-Unknown, 10Patch-For-Review: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Marostegui) p:05Triage→03Medium
[05:20:11] <wikibugs>	 10serviceops, 10DBA, 10WMF-General-or-Unknown, 10Patch-For-Review: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Legoktm) >>! In T307314#7997561, @Marostegui wrote: > @Legoktm this queries would go to the replicas on the slow section for MW right?  Yes. Specifically, `lang=p...
[05:24:18] <wikibugs>	 10serviceops, 10DBA, 10WMF-General-or-Unknown, 10Patch-For-Review: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Marostegui) That's great then. I have lost track of which scripts were changed to reload the config (T298485), is this one of the ones that got that implemented?
[05:45:16] <wikibugs>	 10serviceops, 10DBA, 10WMF-General-or-Unknown, 10Patch-For-Review: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Legoktm) My quick skim of that ticket is that reloading config hasn't been implemented in MW yet, unless it was in a patch not linked on that ticket.
[05:49:32] <wikibugs>	 10serviceops, 10DBA, 10WMF-General-or-Unknown, 10Patch-For-Review: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Marostegui) Maybe we can create an specific task for that implementation so we don't hijack this one :)
[10:37:14] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: High API server request latencies (LIST) - https://phabricator.wikimedia.org/T303184 (10JMeybohm)
[10:37:16] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm)
[10:40:35] <wikibugs>	 10serviceops, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10Kubernetes: Update cfssl-issuer ti cert-manager 1.8.x - https://phabricator.wikimedia.org/T310486 (10JMeybohm)
[10:44:27] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm)
[10:50:06] <wikibugs>	 10serviceops, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10Kubernetes: Update cfssl-issuer to cert-manager 1.8.x - https://phabricator.wikimedia.org/T310486 (10JMeybohm)
[11:48:56] <wikibugs>	 10serviceops, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10Kubernetes: Update cfssl-issuer to cert-manager 1.8.x - https://phabricator.wikimedia.org/T310486 (10JMeybohm) p:05Triage→03Low
[12:41:56] <btullis>	 Hello. We've come across an application of ours (eventgate-*) that is deployed to staging-codfw. We don't think that we need or want this application to be deployed to this cluster, so what would be the best way to delete it? `kubectl` as root, or `helm` as root, or something else? Thanks.
[12:48:07] <jayme>	 btullis: Hi o/ is it an actual problem that it is running there?
[12:49:30] <btullis>	 Only that the pods haven't been updated by our normal helmfile deployment process, so the metrics comning from these pods are skewing the graphs: https://phabricator.wikimedia.org/T294911#7998319
[12:49:31] <jayme>	 we have staging-codfw as a cluster to test kubernetes things and on rare occations we switch what is called "staging" from staging-eqiad to staging-codfw (for k8s updates for example) 
[12:52:05] <btullis>	 Ok, I'm not 100% sure that we need eventgate deployed to staging at all (https://phabricator.wikimedia.org/T294911#7998423 ) so I'll happily take advice on what's best here. It's just that outdated code is running.
[12:52:06] <jayme>	 not having read the complete history of the task: Could you just filter what you need in promql?
[12:53:01] <jayme>	 I mean, you don't *need* it deployed, as it's staging. But you probably want to be able to deploy it to staging :)
[12:53:44] <ottomata>	 we do want to deploy it to staging-eqiad
[12:53:51] <ottomata>	 but according to docs we should not have deployed it to staging-codfw
[12:54:21] <ottomata>	 if we do keep it in staging-codfw, what we should do is upgrade it there and edit our deployment docs on wikitech to also deploy there
[12:55:01] <jayme>	 we don't generally require regular deployments to staging-codfw
[12:55:47] <jayme>	 sorry, I might not have gotten the point here. Is this just about metrics being wrong because they include values from staging-codfw?
[12:56:04] <btullis>	 No, so my first instinct would just be to blat the deployment to staging-codfw and then the graphs would sort themselves out.
[12:56:41] <btullis>	 --^ I skipped a response there. Yes, the problem is just that the metrics are wrong because old code is running on staging-codfw.
[12:58:35] <jayme>	 while I can surely update the deployment in staging-codfw, I would also advice to exclude it from your queries
[12:59:09] <jayme>	 as there is no availability guarantee for staging-codfw at all
[13:00:04] <ottomata>	 the problem isn't so much that grafana is wrong, its just that the code running on staging-codfw is old.
[13:00:09] <ottomata>	 eventgate code^
[13:00:44] <ottomata>	 it would be more correct to either: A. remove eventgate from staging-codfw, or B. deploy latest eventgate to staging-codfw, and always do that when we deploy eventgate.
[13:00:59] <jayme>	 yes. That I got. But that will almost ever be the case (and was in the past)
[13:02:08] <jayme>	 as said. I'm happy to deploy the lastest version to staging-codfw, but this is very likely to happen again at some point as deploying to staging-codfw is nothing that should be done by deployers in general
[13:02:22] <jayme>	 because of the lack of availability of staging-codfw
[13:04:18] <btullis>	 OK, so in terms of the eventgate dashboard in Grafana we currently have a datasource variable. Are you suggesting that it would be preferable for us to exclude staging-codfw from this query?
[13:04:21] <btullis>	 https://usercontent.irccloud-cdn.com/file/DXGc8HoX/image.png
[13:07:17] <jayme>	 Ah, I was expecting some calculation going wrong. If it's about the selector, I would suggest to ignore it as long as staging-codfw is not announced as the active staging cluster.
[13:07:20] <ottomata>	 eventgate uses a test event produce as its readinessProbe, so that means that k8s-staging is producing events to Kafka.  values-staging.yaml helmfile only exists for eqiad, i.e. there is now values-staging-codfw.yaml.  That means that eventgate-* in staging-codfw are all producing to kafka in eqiad.
[13:08:00] <ottomata>	 s/now/no/
[13:08:24] <jayme>	 Sorry, I understand now that this is pretty confusing and potentially was not properly communicated by us (the existance of those two staging clusters I mean)
[13:10:55] <jayme>	 But it would read from kafka in eqiad as well, right? So the redinessProbe works?
[13:11:34] <ottomata>	 eventgaet doesn't really 'read' from kafka, i mean i'm sure it communicates with it via its api, but it doesn't read messages from kafka, just produces
[13:11:47] <wikibugs>	 10serviceops, 10Observability-Logging, 10WMF-General-or-Unknown, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Ingest logs from scheduled maintenance scripts at WMF in Logstash - https://phabricator.wikimedia.org/T285896 (10kostajh) >>! In T285896#7971443, @Tgr wrote: > The patch does...
[13:13:13] <jayme>	 ok. I guess what I want to know is if there is technically a "problem" in running it this way
[13:13:21] <jayme>	 the metrics issue set apart for now
[13:13:51] <ottomata>	 oh, no technically no, just a little werid
[13:14:05] <ottomata>	 if we were going to operate eventgate in codfw staging, then we should configure it properly
[13:15:02] <ottomata>	 the eventgate-analytics instances do intentionally do cross DC kafka producing, but eventgate-main is not supposed to.  In this case it is.  But ya, technically it will operate just fine.
[13:15:03] <btullis>	 Yeah, I agree. It seems odd to have code left running (sort of orphaned) by a switch of 'staging' from codfw back to eqiad.
[13:15:50] <btullis>	 ...but the metrics issue is the glaring problem right now.
[13:19:53] <jayme>	 yeah, okay.
[13:21:01] <jayme>	 The intention for staging-codfw was to have a cluster to work on without interupting deployments (by killing staging-eqiad), but with potentially the same workload and without the need for deployers to always deploy to it (as it might be down does not behave like staging-eqiad). So we more or less hid the fact that it exists apart from short time periods in which we update staging-eqiad for example
[13:21:38] <jayme>	 let's update the deployment in staging-codfw for now, so that the confusion is gone
[13:22:16] <jayme>	 I guess this affects all eventgate instances
[13:22:26] <jayme>	 *?
[13:22:35] <btullis>	 Yes I think that's correct. All eventgate instances.
[13:32:49] <jayme>	 ok, all done
[13:35:05] <btullis>	 Many thanks jayme. Graphs fixed. https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&from=now-15m&to=now&viewPanel=19
[13:36:46] <ottomata>	 ty
[13:50:23] <jayme>	 cool :)
[14:39:51] <wikibugs>	 10serviceops, 10SRE, 10Znuny, 10Patch-For-Review: refactor OTRS role/module/cumin aliases - https://phabricator.wikimedia.org/T293942 (10Arnoldokoth) @Dzahn Yeah, sure. Let me close this now. Thanks.
[14:40:04] <wikibugs>	 10serviceops, 10SRE, 10Znuny, 10Patch-For-Review: refactor OTRS role/module/cumin aliases - https://phabricator.wikimedia.org/T293942 (10Arnoldokoth) 05Open→03Resolved
[16:10:26] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Deployment Autopilot 🛩️): Use scap to deploy itself to scap targets - https://phabricator.wikimedia.org/T303559 (10thcipriani)
[23:03:21] <wikibugs>	 10serviceops, 10SRE, 10Shellbox: Shellbox resource management - https://phabricator.wikimedia.org/T310557 (10RLazarus)
[23:03:33] <wikibugs>	 10serviceops, 10SRE, 10Shellbox: Shellbox resource management - https://phabricator.wikimedia.org/T310557 (10RLazarus) p:05Triage→03Medium
[23:19:03] <wikibugs>	 10serviceops, 10SRE, 10Shellbox: Shellbox resource management - https://phabricator.wikimedia.org/T310557 (10Legoktm) Did we determine whether the most recent spike was legitimate user traffic or malicious/DoS?  The Abstract Wikipedia team has a proposal somewhere for rendering some fragments async, we could...
[23:21:59] <wikibugs>	 10serviceops, 10SRE, 10Shellbox: Shellbox resource management - https://phabricator.wikimedia.org/T310557 (10Legoktm) Also, one of the Wikisources has some Lua magic that renders each score like 4 times because they're PNGs. I think if we switched to/enabled SVG rendering (T49578) we could cut that down to j...