[08:16:43] The commit-message-validator tool has a bug which makes it always exit 0 [08:17:04] that is T360460 and the fix is to invoke it as `commit-message-validator validate` [08:17:05] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1012994 Fix commit-message-validator being always successful [08:17:05] T360460: commit-message-validator job exits successfully even when it fails - https://phabricator.wikimedia.org/T360460 [08:17:21] so if one could review/merge that Puppet change, that will fix it for operations/puppet :-] [08:20:44] hashar: mefge [08:20:47] merged [08:20:53] \o/ [08:20:55] thanks! [08:51:21] I notice an unusual rise in tempauth errors in swift since 14:45 UTC yesterday; dunno if this relates to the DC switchover? [08:51:53] https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?from=1710751904816&to=1710924644816&orgId=1&var-site=All&viewPanel=39 [08:52:30] it's not a _lot_, but normally we have 0, to have >0 over time is unusual [09:11:08] Emperor: 14:30 (start of the errors on moss-fe) coincides with swift-ro switchover [09:11:12] dunno if that helps [09:42:40] Yeah, I think something isn't quite right here :-/ [09:53:27] Emperor: it's currently pooled only in eqiad, do you think we need to repool it in codfw? did we forget something in the process of switching over to only one dc? [09:55:12] One of the drawbacks of tempauth is that it doesn't really log anything. [09:55:34] So I know we have a rise in tempauth failures, because there's a counter for that which we graph. [09:56:01] I'm wondering if there's a (WLOG) client with a duff credential or somesuch [09:57:49] hmm [10:00:16] (and if there's a nicer way of finding that than grobbling through logs looking for 401/403 and seeing if there's an over-represented IP) [10:04:53] Unrelated, but I think there's something wrong with changeprop aswell https://grafana.wikimedia.org/goto/a8wFtAJSk?orgId=1 [10:06:27] containers are getting oomkilled [10:13:45] ok they're in crashloopbackoff, deleteing the pod and letting it re-create seems like it fixes it [10:14:15] I'm going to run a rolling restart, cc akosiaris jayme [10:14:34] well a roll-recreate [10:14:58] claime: ack [10:15:17] you need a hand or something? [10:15:29] no I think I'll be all right [10:15:37] I'll try helmfile --state-values-set roll_restart=1 sync [10:15:42] If that doesn't work I'll kubectl it [10:15:56] can you check if it's doing the same in codfw? [10:16:03] (all containers in CrashLoopBackOff) [10:17:51] claime: they seem fine in codfw [10:17:56] ack [10:21:42] They're starting to crash again [10:22:02] Most of them already have 2+ restarts in less than 5 minutes [10:22:06] nothing in logs [10:24:22] it does not seem like oomk to me, it's exitcode 1 [10:24:39] jayme: the alert that tacked me on to it was oomkill [10:24:42] for changeprop thundering herd kind of stuff shouldn't be the problem, correct? [10:24:55] jayme: now it's something different [10:25:38] claime: should we rollback your changeprop change from yesterday? [10:25:44] just to rule that out? [10:25:49] akosiaris: I guess [10:25:52] it also seems off since yesterday - so we should probably do ^ [10:25:57] Let's [10:26:35] {"name":"change-propagation","hostname":"changeprop-production-665f4c5548-2vd5v","pid":1,"level":"FATAL","err":{"message":"","name":"TypeError","stack":"TypeError: Cannot set property name of which has only a getter\n at Function.assig [10:26:50] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1012771 [10:26:54] revert commit [10:27:40] oh no, wait [10:27:43] If it doesn't work, we'll need to rollback the actual helm release [10:27:47] that won't fix it? [10:27:53] I think we want a pin to the older helm chart [10:27:55] Because it bumped a bunch of versions of other stuff [10:27:57] yeah [10:28:05] let me see if I can get the diff [10:28:09] I'd roll back the helm release thb [10:28:14] <_joe_> +1 [10:28:18] so instead a version: 0.13.x in helmfile.yaml [10:28:20] 0.13.12 [10:28:27] <_joe_> or that [10:28:28] rolling back helm [10:28:32] (in the meantime [10:28:34] ) [10:30:51] 11 Thu Dec 21 16:26:16 2023 superseded changeprop-0.13.2 Upgrade complete [10:31:04] helm -n changeprop rollback 11 [10:31:07] right? [10:31:10] yes [10:31:28] wenn we end up with Claus Conkle again after, we can pin the chart version in helmfile to the be 0.13.2 and roll out the config change [10:31:30] yes [10:31:32] *when [10:32:04] well, helm -n changeprop rollback changeprop 11 [10:32:05] ok, rollback done [10:32:12] the second changeprop being the release name [10:32:21] production is actually the release name but yes :p [10:32:30] lol, yes. [10:32:37] ok so [10:32:59] the config change would need to be done manually? [10:33:31] the messages referring to it might have all been discarded by now? [10:33:51] Also, somehow, it got upgraded from 0.13.2 to 0.13.12 without that being recorded by helm [10:34:25] Didn't fix it [10:34:54] give it a bit, this is changeprop, delayed effects is it's thing [10:35:13] akosiaris: changeprop-production-7d8fcbb5d-49lc9 3/3 Running 2 (61s ago) 3m41s [10:35:23] The pods already have 2 restarts in 3 minutes [10:35:25] but ok [10:36:35] <_joe_> it's apparently picking up? [10:36:39] the fatals are still in the logs as well [10:36:46] "message": "Error during deduplication", [10:36:46] "err_str": "ReplyError: ERR Connection timed out", [10:36:55] it doesn't even say what it is timing out against [10:36:55] <_joe_> that would be redis [10:37:53] yeah I end up getting a redis RedisClient error [10:38:12] <_joe_> so... let me understand [10:38:13] credentials or something? [10:38:20] <_joe_> we rollecd back to the right version? [10:38:24] <_joe_> or not? [10:38:42] Through helm, we rolled back to 0.13.2 [10:38:47] <_joe_> ok [10:38:52] Not 0.13.12 which was yesterday's version [10:38:54] <_joe_> now the errors are different, correct? [10:38:57] <_joe_> ah I see [10:39:10] But 0.13.12 isn't in helm history for some $deity-forsaken reason [10:39:13] <_joe_> ok, let's get to 0.13.12 with setting it explicitly in helmfile then? [10:39:20] let's [10:39:29] <_joe_> claime: you're sure it was 0.13.12? [10:39:41] I'll try and find my diff [10:40:03] netpols appear correct [10:40:06] it was .2 [10:40:10] <_joe_> ok [10:40:11] I read poorly [10:40:15] <_joe_> ok cool [10:40:17] phew [10:40:21] I got very scared [10:40:23] <_joe_> the problem now seems to be with redis [10:40:38] <_joe_> what redises is the config saying it's connecting to? [10:40:50] servers: [10:40:50] - rdb1011.eqiad.wmnet:6379:1 "cp-1" [10:40:50] - rdb1013.eqiad.wmnet:6379:1 "cp-2" [10:40:50] timeout: 1000 [10:41:12] that's nutcracker btw ^ [10:41:58] nothing in nutcracker logs btw [10:43:03] oh damn [10:43:07] i think we're missing a networkpolicy for the ipv6 address of 1011 [10:43:07] this is ipv6 related I think [10:43:07] <_joe_> nutcracker logs are useless [10:43:17] jayme: beat me to it [10:43:18] <_joe_> yep... [10:43:32] yeah, that's it, let's fix that [10:43:46] why did this decide to bite us today? [10:43:51] <_joe_> ok so, we need to pin the chart version for now [10:43:55] <_joe_> akosiaris: yesterday, but yes [10:43:56] all right, so we don't revert, we roll forward, with the ipv6 policy on top [10:44:01] _joe_: why? [10:44:06] (serious question) [10:44:18] <_joe_> claime: because this wasn't the problem you had before [10:44:31] I am not sure either we should rule forward [10:44:34] 1 change at a time [10:44:35] <_joe_> this problem arose because I think we added the ipv6 records lately [10:44:36] That's fair [10:44:38] <_joe_> yes [10:44:42] ok [10:44:45] yeah, I'd also say we stick to 0.13.2 [10:44:48] <_joe_> and after changeprop was last restarted [10:44:49] <_joe_> yes [10:44:51] <_joe_> and tbh [10:44:58] <_joe_> let's use the ips in nutcracker [10:45:01] That config file should not be in the chart, ik ik [10:45:05] <_joe_> instead of adding ipv6? [10:45:18] niah, it's easy to DTRT right now [10:45:22] gimme 2 mins [10:45:40] was about to ask if you're creating a patch...ok [10:46:15] <_joe_> I don't think it's necessarily the right thing, but ok :) [10:46:59] which version am I pinning again? 0.13.12 ? [10:47:05] .2 [10:47:08] 0.13.2 [10:48:04] <_joe_> are we sure there are still problems connecting to redis? [10:48:06] <_joe_> https://grafana.wikimedia.org/d/-Ay4Dd6Vz/redis-dashboard-for-prometheus-redis-exporter-1-x?orgId=1&var-namespace=&var-instance=rdb1011:16379&viewPanel=10&from=now-2d&to=now [10:48:14] <_joe_> looks like it's catching up some slack [10:48:18] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1013018 [10:48:38] maybe it's falling back to 1013 after some time? [10:48:47] I expect it eventually to realize that 1013 is actually ok [10:49:09] and for nutcracker to create a long lasting IPv4 connection to rdb1011 [10:49:35] +1s please ? [10:49:43] <_joe_> done [10:49:57] same :) [10:50:05] thanks [10:50:27] <_joe_> claime: did you only deploy eqiad yesterday? [10:50:31] <_joe_> or also codfw? [10:50:39] <_joe_> because codfw is working rn [10:50:58] _joe_: noth [10:51:00] both* [10:51:08] 11 Tue Mar 19 13:44:49 2024 deployed changeprop-0.14.1 Upgrade complete [10:51:10] <_joe_> ok so I guess it wasn't the chart number [10:51:18] <_joe_> but just the issue with the redis connections [10:51:35] deploying [10:54:15] <_joe_> I see everything in crashloopbackoff rn [10:54:20] <_joe_> in eqiad [10:54:21] yeah, trying to fix it [10:54:53] Did the manual rollback break helmfile? [10:54:57] ok, a forceful deletion of all pods made it happen faster [10:55:08] claime: no, apparently [10:55:25] the only diff I saw thanks to the pin was the netpol diff [10:55:35] I got the first OOMKilled though right now [10:55:41] changeprop-production-7d8fcbb5d-k6lzf 2/3 OOMKilled 0 75s 10.67.158.228 kubernetes1062.eqiad.wmnet [10:55:42] yep, just saw two of them [10:55:59] <_joe_> so yeah, more memory is needed rn? [10:56:01] <_joe_> a ton more? [10:56:11] yeah [10:56:17] we're not lacking in memory [10:56:22] let's give it some [10:56:29] <_joe_> and well [10:56:36] claime: you 'll do the patch? [10:56:41] <_joe_> the alternative is to move eventgate back to codfw right now [10:56:42] sure [10:56:44] or should I? [10:57:20] On it [10:57:32] It's 1500Mi rn, I say 2Gi even, and we'll go from there? [10:57:33] <_joe_> some stuff is getting processed btw [10:57:38] <_joe_> 3Gi [10:57:41] <_joe_> at least [10:57:45] ack [10:58:12] <_joe_> things *are* getting processed rn [10:59:27] <_joe_> but *very* slowly [10:59:34] <_joe_> we might need more replicas too? I dunno [10:59:40] I see some crashloopbackoffs too [10:59:42] with this [10:59:44] "level": "ERROR", [10:59:44] "message": "Exec error in changeprop", [10:59:59] Should stay in quota for container, but I'll raise it a bit for the namespace just in case [11:00:02] but the status is 404 and no clear page pattern [11:00:03] <_joe_> sigh [11:00:10] s/quota/limitrange/ [11:00:30] I think it's not killing it, just spewing out logs in fact [11:00:31] <_joe_> ok I'll say this: we should move restbase back to codfw, all of it [11:01:00] Reason: CrashLoopBackOff [11:01:00] Last State: Terminated [11:01:00] Reason: OOMKilled [11:01:03] no wait, it's memory [11:01:07] <_joe_> yeah it's memory [11:01:22] let's wait for claime's deploy then [11:02:03] <_joe_> yep [11:03:10] <_joe_> maybe we shouldd just kill the transcludes.resource_change stuff eventually [11:03:52] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1013021 [11:03:59] (waiting on ci) [11:04:51] <_joe_> jayme: ^^ can you check? I'm looking at other stuff rn [11:05:05] <_joe_> so yeah if we want to move current events to be processed [11:05:06] claime: I think you can't do less than 100Mi IIRC ? [11:05:18] so that 50Mi needs a bump? [11:05:20] <_joe_> we just need to repool eventgate-main and restbase-async in codfw [11:05:24] akosiaris: those are directly copied from default values [11:05:32] ah, I misremember then [11:05:54] <_joe_> changeprop is now processing 99 objects/s [11:05:57] <_joe_> in eqiad [11:06:05] <_joe_> which isn't great but better than the 2/s before [11:06:28] <_joe_> so I hope with more memory it should be able to acutally get to process backlog [11:06:33] yeah [11:06:39] <_joe_> but I think we need to remove pressure [11:06:53] <_joe_> and repool eventgate in codfw [11:07:00] <_joe_> anyone against it rn? [11:07:09] yeah, me [11:07:20] I am not sure if we actually have an issue [11:07:33] as in... events don't get processed fast enough by changeprop [11:07:34] <_joe_> we have 6 million objects in backlog [11:07:35] so? [11:07:45] Merging memory raise [11:07:55] this is supposed to be updating RESTBase which is deprecated [11:08:05] and some other stuff of course [11:08:12] <_joe_> mostly restbase, yes [11:08:43] and we never got numbers as to how fast restbase **needs** to be updated after a page edit [11:09:08] <_joe_> it's gonna have consequences for edits via VE I think? [11:09:21] I think VE doesn't go via restbase in like all cases rn [11:09:27] <_joe_> unless VE actually uses parsoid now yeah [11:09:36] VE isn't going through restbase anymore iirc [11:09:39] <_joe_> but yes we need the memory [11:09:40] it's like PCS only ? [11:09:49] <_joe_> akosiaris: which is the page summaries, basically [11:09:55] <_joe_> and the mobile applications [11:09:56] which... meh? [11:10:10] <_joe_> we should still increase the memory [11:10:17] <_joe_> just so it stops crashing [11:10:25] <_joe_> which I think is a consequence of the huge backlog [11:10:45] ah, dammit - there is a typo clem [11:10:50] *claime [11:11:13] missing limitrange [11:11:15] yeah [11:11:17] claime: yes [11:11:22] on it [11:15:39] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1013024 < passes CI locally [11:16:47] uh but fails in ci [11:16:53] PS1 fails [11:16:58] because I forgot the pods stanza [11:17:06] PS2 should be all right [11:17:08] ah [11:17:13] wait for CI anyways [11:17:42] the more we get bitten my this the more I think we should maybe drop limitranges :| [11:17:53] <_joe_> I was thinking the same [11:17:55] and only rely on quota? [11:18:00] <_joe_> or make this shit easier to change [11:18:17] V+2 thank you butler [11:18:18] we definitely needs the defaults [11:18:40] but the min/maxes per pods and container...hmmm [11:18:53] for some critical workloads, it's pointless [11:19:00] it needs some love, that's for sure [11:20:49] at least we know the KubernetesContainerOomKilled warning is useful :p [11:21:58] <_joe_> let's hope more memory is enough [11:24:45] akosiaris: you didn't deploy to codfw earlier? [11:24:53] <_joe_> no [11:25:04] never touched codfw [11:25:04] <_joe_> let's leave codfw as-is for now [11:25:25] ack [11:25:57] Deploying new memory limit [11:26:31] all pods running [11:27:04] <_joe_> processing has resumed [11:27:54] I've been doing Advanced Data Science (TM) on the swift logs [11:27:59] [yes, it's a shell pipeline] [11:28:33] Particularly focusing on 401 errors (apropos tempauth). [11:28:52] This found one external IP that's been doing a few thousand a day(!) but that was doing so before the switchover. [11:29:16] <_joe_> Emperor: maybe this isn't the right channel, being public? [11:29:17] But also, the eqiad frontends are logging a lot of 401s from hosts in 10.194.x.x [11:29:31] _joe_: I'm not going to mention said IP here, I think it's an incidental finding [11:29:39] <_joe_> oh ok :) [11:29:42] which codfw wasn't before the switchover [11:29:53] big uplift in dequeue rate and produced messages [11:29:59] <_joe_> yes [11:30:06] containers look stable [11:30:07] nice [11:30:07] <_joe_> 4k messages processed/s [11:30:25] <_joe_> in about 20 hours we'll be over the backlog [11:30:34] more replicas? [11:30:41] <_joe_> actually 1 hour [11:30:47] <_joe_> claime: it wouldn't help with this [11:31:02] Ah yeah because it's pinned thingies right? [11:31:11] can't remember how they're actually called [11:31:12] <_joe_> 1 consumer per topic yes [11:31:14] yeah [11:31:18] but ~1h is fine, isn't it? [11:31:22] <_joe_> yes [11:31:27] jayme: yeah, I reacted to the 20h [11:31:30] 1h is perfectly ok [11:31:30] <_joe_> if it keeps, up, let's see in 10 minutes [11:31:39] cool. so I can pack my computer and go see Fabio now :) [11:31:42] given it's been borked for ~20h [11:31:43] I don't think this can be a consequence of T358830 since I think that change hasn't ridden the train yet [11:31:54] <_joe_> I'm looking at sum(irate(changeprop_normal_rule_processing_count[5m])) [11:32:13] <_joe_> Emperor: uhm, what servers are those IPs for? [11:32:21] Am I right that 10.194.x.x is our k8s range? Is it possible there's some credential gone awry/mis-set (remember the two swift clusters are distinct?) [11:32:35] <_joe_> Emperor: it's very possible ofc [11:32:36] it's probably thumbor pods [11:32:37] max mem usage stable at around 2.2GB [11:32:42] you are correct btw [11:32:45] grep -F ' 401 ' /var/log/swift/proxy-access.log | cut -d ' ' -f 6 | sort | uniq -c | sort -rn -k 1,1 # what I'm looking at [11:32:56] <_joe_> Emperor: give us one ip? [11:33:04] 10.194.134.119 [11:33:07] <_joe_> akosiaris: can you look if those are thumbor pods? [11:33:18] <_joe_> while I check changeprop for a bit more [11:33:31] _joe_: changeprop looks CPU bound now [11:33:39] <_joe_> so actually 1 hour is a bit optimistic [11:33:40] it's getting throttled [11:33:48] <_joe_> claime: it's ok for now [11:33:58] <_joe_> it's working in overdrive already [11:34:00] yeah, not heavily so it's all right [11:34:03] <_joe_> keep an eye on parsoid [11:34:12] <_joe_> I'll keep an eye on purges at the edge [11:34:33] thumbor thumbor-main-5b8b5855ff-8crtv 11/11 Running 111 (30h ago) 27d 10.194.134.119 mw2350.codfw.wmnet [11:34:36] Emperor: ^ [11:34:38] so that's expected [11:35:13] akosiaris: it's surely not expected for thumbor to be getting lots of 401 from swift? Particularly it wasn't happening in codfw pre-switch [11:35:27] Oh, wait, is thumbor trying to use codfw-credentials to talk to eqiad-swift? [11:35:32] oh I meant for swift to see the IP you pasted [11:35:36] _joe_: expected bump in slow processing for parsoid [11:35:38] <_joe_> oh yes [11:35:58] <_joe_> Emperor: that might be it if thumbor uses the discovery record, which is dumb [11:36:05] Emperor: let me doublecheck that [11:36:28] <_joe_> akosiaris: if thumbor uses swift.discovery.wmnet anywhere, that's the problem [11:36:35] <_joe_> and we need to repool it *now* [11:36:45] SWIFT_HOST = 'https://swift.discovery.wmnet' [11:36:46] yes it is [11:36:51] I am repooling it [11:36:52] aha [11:37:00] Emperor: good find [11:37:07] I'm glad I kept poking that niggling "this doesn't look right" :) [11:37:13] Emperor: very good catch [11:37:34] Should we change thumbor's values-{eqiad,codfw}.yaml to point to the dc-local records? [11:37:59] swift Active/Active pooled [11:37:59] swift-ro Active/Active pooled [11:37:59] swift-rw Active/Passive pooled [11:38:00] <_joe_> claime: yes but for now let's repool swift [11:38:04] all 3 only in eqiad [11:38:14] <_joe_> yeah repool all of them :/ [11:38:14] _joe_: ack [11:39:23] ok, something needs fixing there. Glancing at the config, thumbor doesn't have the notion of "fetch from this read-only swift, put result in this r/w swift" [11:39:44] <_joe_> akosiaris: thumbor should only use the local swift, ever [11:39:46] _joe_: re: changeprop, big expected bump of rps to mw-api-int as well [11:39:47] <_joe_> it's by design [11:40:04] <_joe_> if it's not, that's gonna cause a ton of issues [11:40:05] local swift> +1 [11:40:22] (like rps x5 for mw-api-int) [11:40:22] why did it end up with a discovery record in the config then... hmmm [11:40:23] (remember that the two ms swift clusters are entirely distinct, separate credentials, the works) [11:40:29] <_joe_> akosiaris: mistake [11:40:35] <_joe_> 100% a mistake [11:40:35] because a lot of things now use the core parser and not parsoid anymore [11:40:50] <_joe_> is anyone repooling swift? [11:41:02] I assume effie is on top of it? [11:41:05] I am [11:41:16] <_joe_> ah sorry I missed your message effie <3 [11:41:27] I was looking if thumbor is pooled in both dcs [11:41:40] <_joe_> effie: thumbor isn't reached via discovery [11:41:41] https://grafana.wikimedia.org/goto/FZJjxAJIz?orgId=1 < Hello events [11:41:46] _joe_: yeah [11:41:52] <_joe_> it's a local attachment to the swift cluster [11:41:54] I had to re-remember [11:42:13] <_joe_> claime: is mw doing allright? [11:42:18] peachy [11:42:20] <_joe_> mw-api-int, I mean [11:42:23] <_joe_> cool [11:42:28] claime: changeprop (the instance, not the jobqueue) needs to die... [11:42:38] akosiaris: very much agreed [11:43:06] <_joe_> so as I feared, the backlog isn't going down rn [11:43:09] _joe_: how is the edge doing? [11:43:11] <_joe_> because we're generating more links [11:43:27] <_joe_> claime: uhm we should be probably re-forwarding to your last chart version heh [11:43:48] Oh it's still there x) [11:43:53] yeah we should [11:43:53] it dropped off drastically for a while https://grafana-rw.wikimedia.org/d/000300/change-propagation?orgId=1&refresh=1m&from=now-1h&to=now&forceLogin&viewPanel=27 [11:43:57] patch incoming [11:44:01] I suppose it's that page again ? [11:44:05] it is [11:44:12] claime: just revert my version: pin and deploy [11:44:28] yeah [11:44:34] page 48 of the pdf done, how many times again ? [11:44:42] <_joe_> too many :D [11:45:28] 2000^2000 [11:45:33] (yeah) [11:45:38] Emperor: errors are going down now I reckon ? [11:46:42] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1013029 [11:46:57] what's peculiar is that the swift.discovery.wmnet change was merged on Dec 08, 2022 [11:47:08] so, we 've been through 2 switchovers without noticing [11:47:09] ? [11:47:40] <_joe_> akosiaris: we never depooled swift before IIRC [11:47:51] <_joe_> during the switchover [11:47:53] and if end-users noticed... we never heard much from them [11:47:58] <_joe_> because that record made little sense [11:48:01] <_joe_> akosiaris: actually... [11:48:15] <_joe_> I think some issues with thumbnailing that were surfaced yesterday... [11:48:19] which is a different broken thing [11:48:29] I will mend thumbors helmfile [11:48:57] did we not? I don't remember telling Clement to handle swift as an exception 1y ago. [11:49:01] * akosiaris digging into tasks [11:49:21] <_joe_> akosiaris: the cookbook in the past had an exception for swift [11:49:24] Reverting version pin in eqiad [11:49:28] <_joe_> it was listed as one of the excluded services [11:49:34] <_joe_> claime: ack [11:50:26] effie: still non-zero, I'll give it a bit longer to settle [11:50:31] New release deployed [11:50:43] <_joe_> Emperor: non-zero but reduced? [11:52:01] 'swift', # temporary, undergoing rebalancing (T287539#7339799) [11:52:01] 'swift-ro', # per above [11:52:01] T287539: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 [11:52:16] the exclusion was temporary and was removed before last years March switchover [11:52:56] https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?from=now-15m&to=now-1m&orgId=1&var-site=All&viewPanel=39 tempauth now looks good thanks effie [11:52:57] confd firing for swift-rw, see -operations [11:53:11] ok, I got to take a break actually. I am hungry and I need to pickup my daughter from school [11:53:59] mw-api-int will fire for saturation, but latency's all right [11:54:15] (hovering around 60%) [11:54:20] <_joe_> Emperor, effie https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1013033 [11:54:52] Have another big file to exclude actually [11:54:58] Parsing File:Eaddy - English(...) - NARA - 108642302 (page 1309).jpg was slow, took 9.99 seconds [11:55:05] patch incoming [11:55:06] <_joe_> oh christ [11:55:45] ? [11:55:52] Emperor: changeprop stuff [11:55:57] ah, OK [11:57:40] <_joe_> Emperor: yeah sorry I wasn't invoking you [11:57:45] x) [11:57:51] <_joe_> but rather a generic deity who should help us right now [11:58:02] just copped another actually [11:58:08] srsly [11:58:17] <_joe_> claime: stop playing whack-a-mole [11:58:21] <_joe_> I have a proposal [11:58:38] <_joe_> a regexp for anything with (page \d\d\d+) in the title [11:58:49] Hmmm [12:00:28] will changeprop's regex engine take that? [12:00:41] <_joe_> no idea! [12:00:46] Let's find out! [12:01:01] <_joe_> try it in staging first [12:02:59] Emperor: we will leave swift in both DCs pooled for today and I am updating thumbor right now. I will attempt to depool codfw again tomorrow morning along with one more service that was naughty during the switchover [12:08:11] _joe_: Actually that'll still trigger snowballs, just smaller ones [12:08:19] (the \d\d\d+ idea) [12:15:16] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1013035 [12:16:34] It'll be better though [12:21:18] claime: happy to +1 it, since we are experimenting anyway [12:23:38] effie: ack, thanks for the update [12:24:20] <_joe_> claime: I would wait 1 hour or so [12:25:55] Yeah there hasn't been any more these matching files for a bit [12:27:55] summary definitions rerenders isn't going down though :( [12:32:54] I think mobileapps is bottlenecking on memoty [12:33:08] https://grafana.wikimedia.org/goto/fW3uL01Iz?orgId=1 [12:33:49] <_joe_> we've halved the backlog btw [12:34:36] the total backlog? [12:34:52] because some topics aren't going down at all [12:35:06] I see a bunch of them that did resolve though so that's good [12:35:27] <_joe_> yes [12:35:32] <_joe_> the total backlog [12:35:55] <_joe_> it's down from 7.92M to 4.98M [12:36:01] <_joe_> over 1 hour [12:36:32] <_joe_> but now the decrease has slowed, because we're generating a ton of jobs [12:37:54] <_joe_> so I guess I wasn't that wrong saying 20 hours [12:39:19] swift related question [12:39:27] should swift-rw be pooled in both dcs? [12:39:32] I thought it was a/p [12:39:48] afaik it used to be, not anymore, but ask someone else [12:40:03] maybe I am confusing it with cassandra/aqs [12:40:11] <_joe_> claime: swift-rw points to what? [12:40:15] <_joe_> the main swift cluster? [12:40:58] same as the other swift dnsdics [12:41:09] <_joe_> yes it does [12:41:12] 10.2.x.27 [12:41:13] <_joe_> so it's meaningless [12:41:27] <_joe_> I don't know why we introudced it, probably historical reasons [12:41:40] Yeah there's a TODO remove from dns on it in service.yaml [12:41:59] but since it's pooled like that and is active_active: false in service.yaml, it's alerting [12:42:17] or rather because of its dns record I guess [12:42:18] good catch then as an actionable to cleanup [12:42:21] <_joe_> ok, set it to only eqiad I guess? [12:42:24] yeah [13:34:50] Do we have a phab task for (this) switchover and/or associated issues? [13:35:51] Emperor: https://phabricator.wikimedia.org/T357547 [13:42:24] TY :) [13:45:03] fyi, switchover in 15m, please don't run cookbooks for the next 30-45m to avoid interfering with the switchover. [13:45:26] good luck team! [13:46:05] akosiaris: I suggest to cross post this in the two private channels and dcops one [13:47:32] good point, thanks [13:48:36] * volans learned it from past experience ;) [13:50:16] !oncall-now [13:50:16] Oncall now for team SRE, rotation business_hours: [13:50:17] j.ayme, a.kosiaris, h.erron, j.hathaway [13:50:44] Dear SREs, we have started the switchover preliminary work [13:51:14] good luck [13:52:34] break a leg [13:52:39] is it expected/known that cumin1002 has a root tmux with a dry-run switchdc cookbook still running? [13:53:07] akosiaris: where is the switchover coordination going to happen? here or in -operations? [13:53:50] taavi: could be teh test one, we can ignore [13:54:07] marostegui: I would suggest we post anything here if we have to, unless anyone objects [13:54:17] ok [13:55:27] ok. Overall, if you see anyone posting things into other channels and those things are related to the switchover, ask them to repeat here. [14:00:55] mwmaint2002:~$ systemctl list-units 'mediawiki_job_*' says failed [14:06:04] ● mediawiki_job_growthexperiments-listTaskCounts.service loaded failed failed MediaWiki periodic job growthexperiments-listTaskCounts [14:06:07] etc etc [14:06:19] reset-failed, stop [14:06:22] imo [14:06:58] the stop thing has happened fine [14:07:00] doesn't seem like a big deal, as long as it is not running [14:07:03] <_joe_> I think the problem is a shell command launched by bvibber [14:07:14] <_joe_> which is running a script [14:08:00] looks unrelated ? [14:08:15] we 'll have to kill it though I think [14:08:26] afaict those units are all `Main process exited, code=killed, status=15/TERM`, `Failed with result 'signal'.` [14:08:46] which is fine [14:08:47] so they're just sad because we killed them, seems fine to clear and ignore [14:08:48] yeah [14:08:52] I think we do need to kill the script yeah [14:08:53] <_joe_> akosiaris: it needs to be killed [14:08:59] ok, let me reset-failed all these then [14:09:00] I am killking it [14:09:06] cool, thanks [14:09:35] 0 loaded units listed. Pass --all to see loaded but inactive units, too. [14:09:47] so, reset-failed worked fine apparently we are good to proceed [14:10:09] (and next switchover maybe we won't be using systemd units so that won't be an issue 🤞) [14:10:46] 😱 [14:10:53] lgtm to continue [14:10:55] Need to rerun the stopmaintenance imo [14:11:13] yep [14:11:48] <_joe_> it won't work even if we re-run it, let me ping brooke [14:12:16] killing pid 4551 (bash rerun13.sh) should get rid of it I think [14:12:19] _joe_: why wouldn't it? the originating bash script has been killed, what's left is mwscript? [14:12:34] <_joe_> oh it was now I think [14:12:39] which should get killed by stop-maintenance [14:12:51] oh I thought e.ffie was killing it already 👍 [14:13:00] <_joe_> taavi: yeah I wanted to let brooke know before it was killed [14:13:23] I killed it [14:14:22] yep, seems gone to me [14:14:32] final call for GO/NOGO [14:14:33] PASS now [14:14:40] go [14:14:42] rust [14:14:44] wait, wrong joke [14:14:57] lol tx [14:15:09] go [14:15:16] effie: I am ready [14:15:50] * claime listens to the silence [14:15:51] quiet from hatnote [14:16:29] * brouberol plays "Sounds of Silence" [14:16:39] Hello darkness my old friend… [14:17:54] looking good [14:18:14] eqiad writtable [14:18:29] hatnote played an edit again [14:18:29] ping hatnote [14:18:29] <_joe_> sounds [14:18:31] yay [14:18:34] <_joe_> yeah [14:18:36] I can write in eswiki [14:18:40] I just did a test edit on es too [14:18:53] labswiki works too [14:19:21] testwiki also editable with visualeditor [14:19:41] back to the usual edit rate more or less [14:19:41] And commons too [14:19:58] <_joe_> effie: please proceed [14:20:10] <_joe_> we need to get past the jobrunners stage before we're out of impact [14:20:16] POST 5xxs peaked in eqiad, dropping again [14:21:05] * Emperor has no further swift-related spanners to lob in the works [14:21:05] rzl: where? [14:21:07] bare metal? [14:21:09] <_joe_> rzl: post to what? [14:21:19] <_joe_> please be a bit more specific :) [14:21:25] still digging, that was just on https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&refresh=1m&var-site=eqiad&var-cluster=appserver&var-method=GET&var-code=200&var-php_version=All&from=1710943577654&to=1710944477654 [14:21:39] _joe_: I have a networki hiccup [14:21:48] but alex pushed the button for the next stage [14:22:25] elastic@eqiad might scream a bit, seeing a small latency spike (hopefully short) [14:22:39] All DB masters looking fine [14:22:50] <_joe_> we should see the jobrunners pick up the traffic in eqiad [14:22:57] <_joe_> and go down in codfw [14:23:19] <_joe_> confirmed it's happening [14:23:26] definitely picking up in eqiad [14:24:34] <_joe_> yeah error rate recovered [14:25:57] <_joe_> mw-api-int in eqiad is running very hot [14:26:02] <_joe_> but latency is still ok [14:26:19] That's the jobs restarting I bet [14:26:19] <_joe_> utilization is at 65-70% though [14:26:37] <_joe_> claime: it's the result of changeprop's outage from the last day mostly [14:26:42] shhhh [14:26:45] :p [14:26:48] <_joe_> the moving of all rw traffic just got over the edge a bit [14:27:04] it should run fine at 70/75% [14:27:14] as long as latency doesn't go bad we're fine [14:27:18] <_joe_> yeah... [14:28:17] I can throw a few more replicas at it if need be [14:28:29] Is the deployment server going to be switched today too? [14:29:16] <_joe_> claime: not right now, I don't think we need to [14:29:22] _joe_: No I don't think either [14:29:24] those exceptions were all `Wikimedia\Rdbms\DBReadOnlyError: Database is read-only: You can't edit now. This is because of maintenance. Copy and save your text and try again in a few minutes.` from mediawiki in eqiad, both metal and k8s, in the couple of minutes after we went RW https://logstash.wikimedia.org/goto/d4a690118115b6abbb551ab36672f1c5 [14:29:27] Just stating we have the option [14:29:44] filtered out kube-mw-jobrunner in that link because it dominates the numbers but they were affected too [14:29:48] <_joe_> rzl: which is probably stuff that was running before the change propagated [14:29:51] yeah [14:30:03] <_joe_> rzl: I imagine mw-jobrunner mostly in codfw though? [14:30:15] yep [14:32:15] 2m 41s of read-only per cookbook output. [14:33:13] to _joe_'s point maybe we should start tracking the maintenance downtime too [14:33:21] marostegui: tomorrow is our aim for deployment server [14:33:27] maintenance and/or jobrunners downtime [14:33:27] good thanks [14:33:58] <_joe_> rzl: as of now the jobrunners downtime is not user-facing, that's why we never stressed it [14:34:06] <_joe_> now, maybe this will change in the future [14:35:09] mm [14:35:35] I mostly just mean if we've decided the current range of RO time is about right, we can move on to shooting for a high score on something else :D [14:37:35] <_joe_> rzl: I am starting to think we should automate all steps in the ro-phase into one, with the ability to abort [14:37:53] <_joe_> the steps are pretty set in stone by now and relatively straightforward [14:38:16] yeah, I remember we talked about that a couple years ago but now maybe it's a good time [14:38:36] I think we decided there wasn't much upside except for saving a few seconds, but maybe there isn't much downside either [14:39:01] and it means if the operator loses internet during the critical period, the switchover succeeds by default instead of failing by default [14:40:02] yep, we talked about it after my run iirc [14:40:25] <_joe_> ok, I need to caffeinate before my meetings [14:41:43] <_joe_> good job everyone [14:41:56] <_joe_> claime, jayme any idea what to do with changeprop? [14:42:09] yeah gg all :) [14:42:35] _joe_: backlog picked up a bit, going down but I expect it will stabilise at the same size as before the switchover [14:42:39] _joe_: not really...but it smells a bit like restbase is failing for quite some stuff https://logstash.wikimedia.org/goto/6bbaf27628889a292e343499c22c83ea [14:43:14] <_joe_> we'll see how it goes, I would say we let it be for a few hours then reconvene? [14:43:20] not sure if that's solely related to all the old stuff in the queues [14:43:32] I would try raising concurrency for the jobs with the highest backlogs if it doesn't go down below "before switchover levels" [14:44:36] purges have started going down though [14:44:57] <_joe_> jayme: please tell urandom about the cassandra stuff... [14:45:26] apart from the purges it's all summary rerenders, mobile sections [14:45:28] I think you just did :D [14:45:30] which I think is all PCS [14:45:39] :) [14:45:45] * urandom reads backscroll [14:45:48] So we could in theory bump PCS resources a bit and that could maybe help? [14:48:26] at least it seems the backlog is no longer increasing [14:48:55] I don't know grafana is unhappy with me at the moment [14:49:47] it's been unhappy with me all day long today [14:50:08] well..not really. mw-purge is descreasing now but summary_definition is still increasing [14:50:38] mobile_sections slowly decreasing now as well [14:51:08] Thing is I'm not seeing obvious signs of saturation on PCS [14:51:17] So maaaaaybe increase concurrency on these queues? idk [14:51:34] that'll increase pressure on mw-api-int tho [14:51:37] jayme: those (changeprop) errors are pretty opaque [14:52:02] well, no shit :D [14:52:59] I thought maybe that's what restbase returned - or does changeprop talk to cassandra directly? [14:53:34] <_joe_> no it talks to restbase [14:53:40] <_joe_> those are errors returned by restbase [14:53:48] so I thought [14:56:14] I restarted my maint scripts in mwmaint1002 (with Alex and Effie's approval) [14:57:20] I made some numbers, the official :-) read only time was 3 minutes, and 8 seconds; but that's a worse case scenario, in wikidata, the actual maximum time between 2 edits was 2 minutes, and 39 seconds [14:57:21] Lucas_WMDE: see above: [15:33:21] marostegui: tomorrow is our aim for deployment server [14:58:12] ack, sorry, didn’t read all the backscroll yet [14:59:19] Switchover done, thank y'all for being so supportive! [14:59:36] great job effie and the team [14:59:46] gg effie \o/ [14:59:58] nice job team! [15:00:12] nicely done, and once again thanks for making this super easy to follow along :) [15:00:12] and effie for leading it [15:00:36] I think the changeprop errors correspond to these in restbase(?) https://logstash.wikimedia.org/goto/b9463398cbebd94003b0f978349e5b09 [15:01:20] ResponseError: Server timeout during read query at consistency LOCAL_QUORUM (2 replica(s) responded over 3 required) <-- that is very weird [15:01:33] effie: well done, I think it's the first time I see a switchover without elastic or wdqs alerting in some way or another :) [15:01:43] +1 [15:02:15] :D [15:02:16] since replica count (per datacenter) is 3, LOCAL_QUORUM is 2...so 'over 3 required' ? [15:02:35] congrats effie! [15:02:55] ( ^_^)o自自o(^_^ ) CHEERS! [15:04:18] urandom: how many servers are running in that clusters? If 5, then LOCAL_QUORUM would be 3, and if only 2 servers are UP/in UN state, then the read query does not get the required consistency level and fails [15:04:26] at least that's how I remember it [15:05:33] brouberol: there are way more than 5 nodes, but the quorum is for the number of replicas [15:07:04] right, and I take it we have RF=3 for this table? [15:07:14] yes [15:07:53] that _is_ weird then. LOCAL_QUORUM should be 2 [15:16:26] congrats 👏 [15:35:13] Ok, there is a decommission running, which temporarily has the effect of increasing the replica count (to preserve atomicity), so that explains how the (otherwise bizarre) error is *possible*... but even if the temporary quorum is 3, I don't know why that would be failing. [15:35:34] everything seems up and healthy [15:43:46] wait, no, that can't be write... the adjusted replica set should only be for writes, no way that would make sense for reads, which means it is in fact bizarre [15:43:57] haha, can't be "write"... whew [15:59:49] sukhe: let me check the logs what we were running at that time [16:02:18] 2024-03-20 14:16:56,264 jiji 2113574 [INFO _log.py:114 in log_task_start] START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki [16:04:09] effie: I do see the error, just trying to figure why it is failing [16:04:39] it's just _var_lib_gdnsd_discovery-swift-rw.state.toml [16:05:35] sukhe: it may (or may not) be related with us repooling swift* on codfw [16:05:41] this morning [16:06:44] effie: repooling on codfw? [16:06:48] sukhe@cumin2002:~$ etcdctl -C https://conf1007.eqiad.wmnet:4001 get /conftool/v1/discovery/swift-rw/codfw [16:06:51] {"pooled": false, "references": [], "ttl": 300} [16:07:00] this is not expected then? [16:07:42] swift-rw is typically a/p, AIUI [16:08:03] - dnsdisc: swift-rw # TODO: remove this from DNS! [16:08:03] active_active: false [16:08:03] - dnsdisc: swift-ro # TODO: remove this from DNS! [16:08:03] active_active: true [16:08:20] though I think neither swift-rw nor swift-ro are actually used for anything [16:08:24] Emperor: so I did confuse it with aqs indeed [16:08:38] I depooled swift-rw from codfw earlier today btw [16:08:41] through conftool [16:08:50] claime: yeah that adds up [16:08:59] my notes say they were intended for some new development that never happened [16:09:26] sigh, so we have two things here causing us confusion [16:09:41] welcome to swift ;-) [16:09:44] haha [16:10:28] When it's only two things causing confusion with swift, it's a good day [16:10:37] 😿 [16:13:20] effie: my suspicion is that all is good on your end, it's just confd misbehaving, if I at least look at /var/run/confd-template for example [16:14:37] sukhe@dns1004:/var/lib/gdnsd$ cat discovery-swift-rw.state [16:14:37] 10.2.1.27 => DOWN/300 [16:14:37] 10.2.2.27 => UP/300 [16:16:39] ^ this looks fine so there's that [16:16:46] ok, I am just going to clean up the state and see [16:20:15] Ok, so the Cassandra decommission has been stopped, and the changeprop errors have subsided. The quorum issue now makes sense in the context of blocking read repairs (write triggered by a read), and this obscure bug: https://issues.apache.org/jira/browse/CASSANDRA-19120. [16:21:15] I guess this was happening before, but we just didn't notice because the decommissions were happening in eqiad before the switch over [16:23:21] effie: all good, cleaning up the state helped. no actionable for you, sorry for the noise :) [16:24:08] the question of why this happens still remains so I will file a task [16:31:00] <_joe_> sukhe: it's a known limitation [16:31:12] <_joe_> I think there is a note about cleaning up those files [16:31:17] ah ok, let me check [16:31:21] <_joe_> in the switchover procedure [16:31:33] <_joe_> somewhere, or it used to be there [16:31:40] for now, I cleaned them up [16:31:52] <_joe_> thanks <3 [16:32:24] the cookbook does delete them [16:32:31] or at least used to [16:33:10] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/switchdc/mediawiki/09-restore-ttl.py#25 [16:33:55] that matches it, yes [16:34:11] but swift is not in the list of MEDIAWIKI_SERVICES [16:34:16] weird though why it didn't work [16:34:17] ah [16:34:30] in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/switchdc/mediawiki/__init__.py#17 [16:36:07] I am guessing we should swift-rw there in the restore-ttl cookbook [17:38:23] <_joe_> no [17:38:40] <_joe_> it was just a previous change done by hand to fix an outage [17:38:46] <_joe_> sorry, I wans't reading [17:38:50] <_joe_> and about to go offline [17:39:10] np and ok! [21:27:58] <_joe_> jhathaway, urandom, kamila_ let's move here [21:28:03] <_joe_> so one thing that matches is [21:28:05] <_joe_> https://grafana.wikimedia.org/d/lxZAdAdMk/wikifeeds?orgId=1&var-dc=thanos&var-site=eqiad&var-service=wikifeeds&var-prometheus=k8s&var-container_name=All&refresh=30s&from=now-30d&to=now&viewPanel=27 [21:28:17] <_joe_> network rx bytes increasead around that time [21:28:53] hmm [21:30:00] transmit looks pretty flat, so perhaps bogus inbound traffic? [21:30:17] what talks to wikifeeds? [21:30:23] <_joe_> I don't think it's bogus, probably something changed in an upstream of it [21:30:31] <_joe_> kamila_: wikifeeds talks to everything [21:30:40] <_joe_> ATS talks to it via the rest gateway [21:30:46] oh, rx, yes [21:31:04] <_joe_> but the feeds that became slow are the on-this-day feed and the one about specific dates [21:31:10] <_joe_> which go to aqs [21:31:21] <_joe_> but aqs seems unimpressed [21:31:25] <_joe_> so, hear me out [21:31:40] <_joe_> I think we can try turning it on and off again [21:31:45] :D [21:31:48] <_joe_> (wikifeeds) [21:31:53] <_joe_> but tomorrow is fine too [21:32:00] ctrl-alt-delete [21:32:14] given the 42 day runtime, seems sensible to me [21:34:07] I don't mind giving it a kick now, but I am curious whether you have specific reasons to think that'd help [21:34:09] <_joe_> https://grafana.wikimedia.org/d/UWuaaNl4k/aqs-2-0?orgId=1&var-dc=thanos&var-service=page-analytics&var-site=eqiad&var-prometheus=k8s&var-container_name=All&viewPanel=34&from=now-90d&to=now aqs is pretty stable I'd say [21:34:40] <_joe_> kamila_: basically it looks like nothing besides itself can be slow in the puzzle, but I'd dig a bit more [21:34:40] that is a boring graph [21:35:12] <_joe_> urandom: golang for you [21:35:16] _joe_: fair enough, thanks [21:35:43] I didn't know wikifeeds talks to aqs [21:37:56] <_joe_> via rest-gateway ofc [21:38:03] <_joe_> but even there, [21:38:05] <_joe_> https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-app=wikifeeds&var-kubernetes_namespace=All&var-destination=rest-gateway&var-destination=restbase-for-services&viewPanel=14&from=now-30d&to=now [21:42:00] I'll kick it now, OK? [21:42:21] kick = kubectl rollout restart [21:42:33] <_joe_> kamila_: actually use helmfile [21:43:20] <_joe_> https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_restart [21:43:26] oh, ok, thanks, til it has roll restarts without needing a change :D [21:43:41] oh, ours does [21:43:42] right [21:44:05] <_joe_> yeah it's a horrible hack we did :) [21:44:24] I keep trying to forget! :D [21:44:26] <_joe_> but it works well [21:44:46] <_joe_> as long as you don't look how the sausage is made [21:44:52] I did once :D [21:45:09] that's probably why my brain wants to not remember :D [21:51:18] well, that doesn't seem to have helped! [21:51:30] :( [21:51:36] yeah, no change in graphs [21:52:20] I suppose if the world were to end, we could pool codfw and depool, but the world isn't ending [21:52:28] *depool eqiad [21:53:05] just sayin that options exist [21:53:37] good to have options ;) [21:53:56] might just be a better option to raise the threshold for the page though, given that it's been like that forever :D [21:54:27] (and fix it of course, but just for now if it annoys you) [21:57:54] yeah [22:08:28] * jhathaway looks for knob [22:09:37] uh that's weird, either grafana hates me or there are no metrics for wikifeeds envoy telemetry in eqiad [22:09:49] and I have a guess why :D [22:13:42] but my guess is wrong! [22:13:54] and the clock says I really should go [22:14:22] indeed, thanks for your help kamila_! [22:15:26] jhathaway: if you do decide to change the alert instead of just silencing it, you can add me to the CR as a fallback for remembering to change it back [22:15:38] will do [22:15:45] np, I wasn't much help :D [22:15:47] o/ [22:19:22] o/ [22:27:25] (ok, grafana hates me, the metrics exist, they just don't say anything interesting... and now I'm really off XD) [22:29:53] XD [22:33:48] !incidents [22:33:48] 4530 (RESOLVED) ATSBackendErrorsHigh cache_text sre (rest-gateway.discovery.wmnet esams) [22:33:49] 4529 (RESOLVED) ATSBackendErrorsHigh cache_text sre (rest-gateway.discovery.wmnet esams) [22:33:49] 4528 (RESOLVED) ATSBackendErrorsHigh cache_text sre (rest-gateway.discovery.wmnet esams) [23:00:32] 🤞