[01:01:15] 10serviceops, 10DBA, 10WMF-General-or-Unknown, 10Patch-For-Review: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Legoktm) With very aggressive rounding up to nearest minute, the enwiki run takes 42 minutes. commonswiki is 8.8 hours, wikidatawiki is ~50 minutes. I'm tagging #... [03:28:47] 10serviceops, 10DBA, 10WMF-General-or-Unknown, 10Patch-For-Review: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Legoktm) [05:02:08] 10serviceops, 10SRE: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10KartikMistry) Upgrade note: node14 has removed symlink of nodejs -> node command. [05:16:06] 10serviceops, 10DBA, 10WMF-General-or-Unknown, 10Patch-For-Review: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Marostegui) @Legoktm this queries would go to the replicas on the slow section for MW right? [05:16:21] 10serviceops, 10DBA, 10WMF-General-or-Unknown, 10Patch-For-Review: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Marostegui) p:05Triage→03Medium [05:20:11] 10serviceops, 10DBA, 10WMF-General-or-Unknown, 10Patch-For-Review: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Legoktm) >>! In T307314#7997561, @Marostegui wrote: > @Legoktm this queries would go to the replicas on the slow section for MW right? Yes. Specifically, `lang=p... [05:24:18] 10serviceops, 10DBA, 10WMF-General-or-Unknown, 10Patch-For-Review: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Marostegui) That's great then. I have lost track of which scripts were changed to reload the config (T298485), is this one of the ones that got that implemented? [05:45:16] 10serviceops, 10DBA, 10WMF-General-or-Unknown, 10Patch-For-Review: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Legoktm) My quick skim of that ticket is that reloading config hasn't been implemented in MW yet, unless it was in a patch not linked on that ticket. [05:49:32] 10serviceops, 10DBA, 10WMF-General-or-Unknown, 10Patch-For-Review: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Marostegui) Maybe we can create an specific task for that implementation so we don't hijack this one :) [10:37:14] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: High API server request latencies (LIST) - https://phabricator.wikimedia.org/T303184 (10JMeybohm) [10:37:16] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [10:40:35] 10serviceops, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10Kubernetes: Update cfssl-issuer ti cert-manager 1.8.x - https://phabricator.wikimedia.org/T310486 (10JMeybohm) [10:44:27] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [10:50:06] 10serviceops, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10Kubernetes: Update cfssl-issuer to cert-manager 1.8.x - https://phabricator.wikimedia.org/T310486 (10JMeybohm) [11:48:56] 10serviceops, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10Kubernetes: Update cfssl-issuer to cert-manager 1.8.x - https://phabricator.wikimedia.org/T310486 (10JMeybohm) p:05Triage→03Low [12:41:56] Hello. We've come across an application of ours (eventgate-*) that is deployed to staging-codfw. We don't think that we need or want this application to be deployed to this cluster, so what would be the best way to delete it? `kubectl` as root, or `helm` as root, or something else? Thanks. [12:48:07] btullis: Hi o/ is it an actual problem that it is running there? [12:49:30] Only that the pods haven't been updated by our normal helmfile deployment process, so the metrics comning from these pods are skewing the graphs: https://phabricator.wikimedia.org/T294911#7998319 [12:49:31] we have staging-codfw as a cluster to test kubernetes things and on rare occations we switch what is called "staging" from staging-eqiad to staging-codfw (for k8s updates for example) [12:52:05] Ok, I'm not 100% sure that we need eventgate deployed to staging at all (https://phabricator.wikimedia.org/T294911#7998423 ) so I'll happily take advice on what's best here. It's just that outdated code is running. [12:52:06] not having read the complete history of the task: Could you just filter what you need in promql? [12:53:01] I mean, you don't *need* it deployed, as it's staging. But you probably want to be able to deploy it to staging :) [12:53:44] we do want to deploy it to staging-eqiad [12:53:51] but according to docs we should not have deployed it to staging-codfw [12:54:21] if we do keep it in staging-codfw, what we should do is upgrade it there and edit our deployment docs on wikitech to also deploy there [12:55:01] we don't generally require regular deployments to staging-codfw [12:55:47] sorry, I might not have gotten the point here. Is this just about metrics being wrong because they include values from staging-codfw? [12:56:04] No, so my first instinct would just be to blat the deployment to staging-codfw and then the graphs would sort themselves out. [12:56:41] --^ I skipped a response there. Yes, the problem is just that the metrics are wrong because old code is running on staging-codfw. [12:58:35] while I can surely update the deployment in staging-codfw, I would also advice to exclude it from your queries [12:59:09] as there is no availability guarantee for staging-codfw at all [13:00:04] the problem isn't so much that grafana is wrong, its just that the code running on staging-codfw is old. [13:00:09] eventgate code^ [13:00:44] it would be more correct to either: A. remove eventgate from staging-codfw, or B. deploy latest eventgate to staging-codfw, and always do that when we deploy eventgate. [13:00:59] yes. That I got. But that will almost ever be the case (and was in the past) [13:02:08] as said. I'm happy to deploy the lastest version to staging-codfw, but this is very likely to happen again at some point as deploying to staging-codfw is nothing that should be done by deployers in general [13:02:22] because of the lack of availability of staging-codfw [13:04:18] OK, so in terms of the eventgate dashboard in Grafana we currently have a datasource variable. Are you suggesting that it would be preferable for us to exclude staging-codfw from this query? [13:04:21] https://usercontent.irccloud-cdn.com/file/DXGc8HoX/image.png [13:07:17] Ah, I was expecting some calculation going wrong. If it's about the selector, I would suggest to ignore it as long as staging-codfw is not announced as the active staging cluster. [13:07:20] eventgate uses a test event produce as its readinessProbe, so that means that k8s-staging is producing events to Kafka. values-staging.yaml helmfile only exists for eqiad, i.e. there is now values-staging-codfw.yaml. That means that eventgate-* in staging-codfw are all producing to kafka in eqiad. [13:08:00] s/now/no/ [13:08:24] Sorry, I understand now that this is pretty confusing and potentially was not properly communicated by us (the existance of those two staging clusters I mean) [13:10:55] But it would read from kafka in eqiad as well, right? So the redinessProbe works? [13:11:34] eventgaet doesn't really 'read' from kafka, i mean i'm sure it communicates with it via its api, but it doesn't read messages from kafka, just produces [13:11:47] 10serviceops, 10Observability-Logging, 10WMF-General-or-Unknown, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Ingest logs from scheduled maintenance scripts at WMF in Logstash - https://phabricator.wikimedia.org/T285896 (10kostajh) >>! In T285896#7971443, @Tgr wrote: > The patch does... [13:13:13] ok. I guess what I want to know is if there is technically a "problem" in running it this way [13:13:21] the metrics issue set apart for now [13:13:51] oh, no technically no, just a little werid [13:14:05] if we were going to operate eventgate in codfw staging, then we should configure it properly [13:15:02] the eventgate-analytics instances do intentionally do cross DC kafka producing, but eventgate-main is not supposed to. In this case it is. But ya, technically it will operate just fine. [13:15:03] Yeah, I agree. It seems odd to have code left running (sort of orphaned) by a switch of 'staging' from codfw back to eqiad. [13:15:50] ...but the metrics issue is the glaring problem right now. [13:19:53] yeah, okay. [13:21:01] The intention for staging-codfw was to have a cluster to work on without interupting deployments (by killing staging-eqiad), but with potentially the same workload and without the need for deployers to always deploy to it (as it might be down does not behave like staging-eqiad). So we more or less hid the fact that it exists apart from short time periods in which we update staging-eqiad for example [13:21:38] let's update the deployment in staging-codfw for now, so that the confusion is gone [13:22:16] I guess this affects all eventgate instances [13:22:26] *? [13:22:35] Yes I think that's correct. All eventgate instances. [13:32:49] ok, all done [13:35:05] Many thanks jayme. Graphs fixed. https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&from=now-15m&to=now&viewPanel=19 [13:36:46] ty [13:50:23] cool :) [14:39:51] 10serviceops, 10SRE, 10Znuny, 10Patch-For-Review: refactor OTRS role/module/cumin aliases - https://phabricator.wikimedia.org/T293942 (10Arnoldokoth) @Dzahn Yeah, sure. Let me close this now. Thanks. [14:40:04] 10serviceops, 10SRE, 10Znuny, 10Patch-For-Review: refactor OTRS role/module/cumin aliases - https://phabricator.wikimedia.org/T293942 (10Arnoldokoth) 05Open→03Resolved [16:10:26] 10serviceops, 10Infrastructure-Foundations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Deployment Autopilot 🛩ī¸): Use scap to deploy itself to scap targets - https://phabricator.wikimedia.org/T303559 (10thcipriani) [23:03:21] 10serviceops, 10SRE, 10Shellbox: Shellbox resource management - https://phabricator.wikimedia.org/T310557 (10RLazarus) [23:03:33] 10serviceops, 10SRE, 10Shellbox: Shellbox resource management - https://phabricator.wikimedia.org/T310557 (10RLazarus) p:05Triage→03Medium [23:19:03] 10serviceops, 10SRE, 10Shellbox: Shellbox resource management - https://phabricator.wikimedia.org/T310557 (10Legoktm) Did we determine whether the most recent spike was legitimate user traffic or malicious/DoS? The Abstract Wikipedia team has a proposal somewhere for rendering some fragments async, we could... [23:21:59] 10serviceops, 10SRE, 10Shellbox: Shellbox resource management - https://phabricator.wikimedia.org/T310557 (10Legoktm) Also, one of the Wikisources has some Lua magic that renders each score like 4 times because they're PNGs. I think if we switched to/enabled SVG rendering (T49578) we could cut that down to j...