[08:16:43] The commit-message-validator tool has a bug which makes it always exit 0 [08:17:04] that is T360460 and the fix is to invoke it as `commit-message-validator validate` [08:17:05] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1012994 Fix commit-message-validator being always successful [08:17:05] T360460: commit-message-validator job exits successfully even when it fails - https://phabricator.wikimedia.org/T360460 [08:17:21] so if one could review/merge that Puppet change, that will fix it for operations/puppet :-] [08:20:44] hashar: mefge [08:20:47] merged [08:20:53] \o/ [08:20:55] thanks! [08:51:21] I notice an unusual rise in tempauth errors in swift since 14:45 UTC yesterday; dunno if this relates to the DC switchover? [08:51:53] https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?from=1710751904816&to=1710924644816&orgId=1&var-site=All&viewPanel=39 [08:52:30] it's not a _lot_, but normally we have 0, to have >0 over time is unusual [09:11:08] Emperor: 14:30 (start of the errors on moss-fe) coincides with swift-ro switchover [09:11:12] dunno if that helps [09:42:40] Yeah, I think something isn't quite right here :-/ [09:53:27] Emperor: it's currently pooled only in eqiad, do you think we need to repool it in codfw? did we forget something in the process of switching over to only one dc? [09:55:12] One of the drawbacks of tempauth is that it doesn't really log anything. [09:55:34] So I know we have a rise in tempauth failures, because there's a counter for that which we graph. [09:56:01] I'm wondering if there's a (WLOG) client with a duff credential or somesuch [09:57:49] hmm [10:00:16] (and if there's a nicer way of finding that than grobbling through logs looking for 401/403 and seeing if there's an over-represented IP) [10:04:53] Unrelated, but I think there's something wrong with changeprop aswell https://grafana.wikimedia.org/goto/a8wFtAJSk?orgId=1 [10:06:27] containers are getting oomkilled [10:13:45] ok they're in crashloopbackoff, deleteing the pod and letting it re-create seems like it fixes it [10:14:15] I'm going to run a rolling restart, cc akosiaris jayme [10:14:34] well a roll-recreate [10:14:58] claime: ack [10:15:17] you need a hand or something? [10:15:29] no I think I'll be all right [10:15:37] I'll try helmfile --state-values-set roll_restart=1 sync [10:15:42] If that doesn't work I'll kubectl it [10:15:56] can you check if it's doing the same in codfw? [10:16:03] (all containers in CrashLoopBackOff) [10:17:51] claime: they seem fine in codfw [10:17:56] ack [10:21:42] They're starting to crash again [10:22:02] Most of them already have 2+ restarts in less than 5 minutes [10:22:06] nothing in logs [10:24:22] it does not seem like oomk to me, it's exitcode 1 [10:24:39] jayme: the alert that tacked me on to it was oomkill [10:24:42] for changeprop thundering herd kind of stuff shouldn't be the problem, correct? [10:24:55] jayme: now it's something different [10:25:38] claime: should we rollback your changeprop change from yesterday? [10:25:44] just to rule that out? [10:25:49] akosiaris: I guess [10:25:52] it also seems off since yesterday - so we should probably do ^ [10:25:57] Let's [10:26:35] {"name":"change-propagation","hostname":"changeprop-production-665f4c5548-2vd5v","pid":1,"level":"FATAL","err":{"message":"","name":"TypeError","stack":"TypeError: Cannot set property name of which has only a getter\n at Function.assig [10:26:50] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1012771 [10:26:54] revert commit [10:27:40] oh no, wait [10:27:43] If it doesn't work, we'll need to rollback the actual helm release [10:27:47] that won't fix it? [10:27:53] I think we want a pin to the older helm chart [10:27:55] Because it bumped a bunch of versions of other stuff [10:27:57] yeah [10:28:05] let me see if I can get the diff [10:28:09] I'd roll back the helm release thb [10:28:14] <_joe_> +1 [10:28:18] so instead a version: 0.13.x in helmfile.yaml [10:28:20] 0.13.12 [10:28:27] <_joe_> or that [10:28:28] rolling back helm [10:28:32] (in the meantime [10:28:34] ) [10:30:51] 11 Thu Dec 21 16:26:16 2023 superseded changeprop-0.13.2 Upgrade complete [10:31:04] helm -n changeprop rollback 11 [10:31:07] right? [10:31:10] yes [10:31:28] wenn we end up with Claus Conkle again after, we can pin the chart version in helmfile to the be 0.13.2 and roll out the config change [10:31:30] yes [10:31:32] *when [10:32:04] well, helm -n changeprop rollback changeprop 11 [10:32:05] ok, rollback done [10:32:12] the second changeprop being the release name [10:32:21] production is actually the release name but yes :p [10:32:30] lol, yes. [10:32:37] ok so [10:32:59] the config change would need to be done manually? [10:33:31] the messages referring to it might have all been discarded by now? [10:33:51] Also, somehow, it got upgraded from 0.13.2 to 0.13.12 without that being recorded by helm [10:34:25] Didn't fix it [10:34:54] give it a bit, this is changeprop, delayed effects is it's thing [10:35:13] akosiaris: changeprop-production-7d8fcbb5d-49lc9 3/3 Running 2 (61s ago) 3m41s [10:35:23] The pods already have 2 restarts in 3 minutes [10:35:25] but ok [10:36:35] <_joe_> it's apparently picking up? [10:36:39] the fatals are still in the logs as well [10:36:46] "message": "Error during deduplication", [10:36:46] "err_str": "ReplyError: ERR Connection timed out", [10:36:55] it doesn't even say what it is timing out against [10:36:55] <_joe_> that would be redis [10:37:53] yeah I end up getting a redis RedisClient error [10:38:12] <_joe_> so... let me understand [10:38:13] credentials or something? [10:38:20] <_joe_> we rollecd back to the right version? [10:38:24] <_joe_> or not? [10:38:42] Through helm, we rolled back to 0.13.2 [10:38:47] <_joe_> ok [10:38:52] Not 0.13.12 which was yesterday's version [10:38:54] <_joe_> now the errors are different, correct? [10:38:57] <_joe_> ah I see [10:39:10] But 0.13.12 isn't in helm history for some $deity-forsaken reason [10:39:13] <_joe_> ok, let's get to 0.13.12 with setting it explicitly in helmfile then? [10:39:20] let's [10:39:29] <_joe_> claime: you're sure it was 0.13.12? [10:39:41] I'll try and find my diff [10:40:03] netpols appear correct [10:40:06] it was .2 [10:40:10] <_joe_> ok [10:40:11] I read poorly [10:40:15] <_joe_> ok cool [10:40:17] phew [10:40:21] I got very scared [10:40:23] <_joe_> the problem now seems to be with redis [10:40:38] <_joe_> what redises is the config saying it's connecting to? [10:40:50] servers: [10:40:50] - rdb1011.eqiad.wmnet:6379:1 "cp-1" [10:40:50] - rdb1013.eqiad.wmnet:6379:1 "cp-2" [10:40:50] timeout: 1000 [10:41:12] that's nutcracker btw ^ [10:41:58] nothing in nutcracker logs btw [10:43:03] oh damn [10:43:07] i think we're missing a networkpolicy for the ipv6 address of 1011 [10:43:07] this is ipv6 related I think [10:43:07] <_joe_> nutcracker logs are useless [10:43:17] jayme: beat me to it [10:43:18] <_joe_> yep... [10:43:32] yeah, that's it, let's fix that [10:43:46] why did this decide to bite us today? [10:43:51] <_joe_> ok so, we need to pin the chart version for now [10:43:55] <_joe_> akosiaris: yesterday, but yes [10:43:56] all right, so we don't revert, we roll forward, with the ipv6 policy on top [10:44:01] _joe_: why? [10:44:06] (serious question) [10:44:18] <_joe_> claime: because this wasn't the problem you had before [10:44:31] I am not sure either we should rule forward [10:44:34] 1 change at a time [10:44:35] <_joe_> this problem arose because I think we added the ipv6 records lately [10:44:36] That's fair [10:44:38] <_joe_> yes [10:44:42] ok [10:44:45] yeah, I'd also say we stick to 0.13.2 [10:44:48] <_joe_> and after changeprop was last restarted [10:44:49] <_joe_> yes [10:44:51] <_joe_> and tbh [10:44:58] <_joe_> let's use the ips in nutcracker [10:45:01] That config file should not be in the chart, ik ik [10:45:05] <_joe_> instead of adding ipv6? [10:45:18] niah, it's easy to DTRT right now [10:45:22] gimme 2 mins [10:45:40] was about to ask if you're creating a patch...ok [10:46:15] <_joe_> I don't think it's necessarily the right thing, but ok :) [10:46:59] which version am I pinning again? 0.13.12 ? [10:47:05] .2 [10:47:08] 0.13.2 [10:48:04] <_joe_> are we sure there are still problems connecting to redis? [10:48:06] <_joe_> https://grafana.wikimedia.org/d/-Ay4Dd6Vz/redis-dashboard-for-prometheus-redis-exporter-1-x?orgId=1&var-namespace=&var-instance=rdb1011:16379&viewPanel=10&from=now-2d&to=now [10:48:14] <_joe_> looks like it's catching up some slack [10:48:18] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1013018 [10:48:38] maybe it's falling back to 1013 after some time? [10:48:47] I expect it eventually to realize that 1013 is actually ok [10:49:09] and for nutcracker to create a long lasting IPv4 connection to rdb1011 [10:49:35] +1s please ? [10:49:43] <_joe_> done [10:49:57] same :) [10:50:05] thanks [10:50:27] <_joe_> claime: did you only deploy eqiad yesterday? [10:50:31] <_joe_> or also codfw? [10:50:39] <_joe_> because codfw is working rn [10:50:58] _joe_: noth [10:51:00] both* [10:51:08] 11 Tue Mar 19 13:44:49 2024 deployed changeprop-0.14.1 Upgrade complete [10:51:10] <_joe_> ok so I guess it wasn't the chart number [10:51:18] <_joe_> but just the issue with the redis connections [10:51:35] deploying [10:54:15] <_joe_> I see everything in crashloopbackoff rn [10:54:20] <_joe_> in eqiad [10:54:21] yeah, trying to fix it [10:54:53] Did the manual rollback break helmfile? [10:54:57] ok, a forceful deletion of all pods made it happen faster [10:55:08] claime: no, apparently [10:55:25] the only diff I saw thanks to the pin was the netpol diff [10:55:35] I got the first OOMKilled though right now [10:55:41] changeprop-production-7d8fcbb5d-k6lzf 2/3 OOMKilled 0 75s 10.67.158.228 kubernetes1062.eqiad.wmnet [10:55:42] yep, just saw two of them [10:55:59] <_joe_> so yeah, more memory is needed rn? [10:56:01] <_joe_> a ton more? [10:56:11] yeah [10:56:17] we're not lacking in memory [10:56:22] let's give it some [10:56:29] <_joe_> and well [10:56:36] claime: you 'll do the patch? [10:56:41] <_joe_> the alternative is to move eventgate back to codfw right now [10:56:42] sure [10:56:44] or should I? [10:57:20] On it [10:57:32] It's 1500Mi rn, I say 2Gi even, and we'll go from there? [10:57:33] <_joe_> some stuff is getting processed btw [10:57:38] <_joe_> 3Gi [10:57:41] <_joe_> at least [10:57:45] ack [10:58:12] <_joe_> things *are* getting processed rn [10:59:27] <_joe_> but *very* slowly [10:59:34] <_joe_> we might need more replicas too? I dunno [10:59:40] I see some crashloopbackoffs too [10:59:42] with this [10:59:44] "level": "ERROR", [10:59:44] "message": "Exec error in changeprop", [10:59:59] Should stay in quota for container, but I'll raise it a bit for the namespace just in case [11:00:02] but the status is 404 and no clear page pattern [11:00:03] <_joe_> sigh [11:00:10] s/quota/limitrange/ [11:00:30] I think it's not killing it, just spewing out logs in fact [11:00:31] <_joe_> ok I'll say this: we should move restbase back to codfw, all of it [11:01:00] Reason: CrashLoopBackOff [11:01:00] Last State: Terminated [11:01:00] Reason: OOMKilled [11:01:03] no wait, it's memory [11:01:07] <_joe_> yeah it's memory [11:01:22] let's wait for claime's deploy then [11:02:03] <_joe_> yep [11:03:10] <_joe_> maybe we shouldd just kill the transcludes.resource_change stuff eventually [11:03:52] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1013021 [11:03:59] (waiting on ci) [11:04:51] <_joe_> jayme: ^^ can you check? I'm looking at other stuff rn [11:05:05] <_joe_> so yeah if we want to move current events to be processed [11:05:06] claime: I think you can't do less than 100Mi IIRC ? [11:05:18] so that 50Mi needs a bump? [11:05:20] <_joe_> we just need to repool eventgate-main and restbase-async in codfw [11:05:24] akosiaris: those are directly copied from default values [11:05:32] ah, I misremember then [11:05:54] <_joe_> changeprop is now processing 99 objects/s [11:05:57] <_joe_> in eqiad [11:06:05] <_joe_> which isn't great but better than the 2/s before [11:06:28] <_joe_> so I hope with more memory it should be able to acutally get to process backlog [11:06:33] yeah [11:06:39] <_joe_> but I think we need to remove pressure [11:06:53] <_joe_> and repool eventgate in codfw [11:07:00] <_joe_> anyone against it rn? [11:07:09] yeah, me [11:07:20] I am not sure if we actually have an issue [11:07:33] as in... events don't get processed fast enough by changeprop [11:07:34] <_joe_> we have 6 million objects in backlog [11:07:35] so? [11:07:45] Merging memory raise [11:07:55] this is supposed to be updating RESTBase which is deprecated [11:08:05] and some other stuff of course [11:08:12] <_joe_> mostly restbase, yes [11:08:43] and we never got numbers as to how fast restbase **needs** to be updated after a page edit [11:09:08] <_joe_> it's gonna have consequences for edits via VE I think? [11:09:21] I think VE doesn't go via restbase in like all cases rn [11:09:27] <_joe_> unless VE actually uses parsoid now yeah [11:09:36] VE isn't going through restbase anymore iirc [11:09:39] <_joe_> but yes we need the memory [11:09:40] it's like PCS only ? [11:09:49] <_joe_> akosiaris: which is the page summaries, basically [11:09:55] <_joe_> and the mobile applications [11:09:56] which... meh? [11:10:10] <_joe_> we should still increase the memory [11:10:17] <_joe_> just so it stops crashing [11:10:25] <_joe_> which I think is a consequence of the huge backlog [11:10:45] ah, dammit - there is a typo clem [11:10:50] *claime [11:11:13] missing limitrange [11:11:15] yeah [11:11:17] claime: yes [11:11:22] on it [11:15:39] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1013024 < passes CI locally [11:16:47] uh but fails in ci [11:16:53] PS1 fails [11:16:58] because I forgot the pods stanza [11:17:06] PS2 should be all right [11:17:08] ah [11:17:13] wait for CI anyways [11:17:42] the more we get bitten my this the more I think we should maybe drop limitranges :| [11:17:53] <_joe_> I was thinking the same [11:17:55] and only rely on quota? [11:18:00] <_joe_> or make this shit easier to change [11:18:17] V+2 thank you butler [11:18:18] we definitely needs the defaults [11:18:40] but the min/maxes per pods and container...hmmm [11:18:53] for some critical workloads, it's pointless [11:19:00] it needs some love, that's for sure [11:20:49] at least we know the KubernetesContainerOomKilled warning is useful :p [11:21:58] <_joe_> let's hope more memory is enough [11:24:45] akosiaris: you didn't deploy to codfw earlier? [11:24:53] <_joe_> no [11:25:04] never touched codfw [11:25:04] <_joe_> let's leave codfw as-is for now [11:25:25] ack [11:25:57] Deploying new memory limit [11:26:31] all pods running [11:27:04] <_joe_> processing has resumed [11:27:54] I've been doing Advanced Data Science (TM) on the swift logs [11:27:59] [yes, it's a shell pipeline] [11:28:33] Particularly focusing on 401 errors (apropos tempauth). [11:28:52] This found one external IP that's been doing a few thousand a day(!) but that was doing so before the switchover. [11:29:16] <_joe_> Emperor: maybe this isn't the right channel, being public? [11:29:17] But also, the eqiad frontends are logging a lot of 401s from hosts in 10.194.x.x [11:29:31] _joe_: I'm not going to mention said IP here, I think it's an incidental finding [11:29:39] <_joe_> oh ok :) [11:29:42] which codfw wasn't before the switchover [11:29:53] big uplift in dequeue rate and produced messages [11:29:59] <_joe_> yes [11:30:06] containers look stable [11:30:07] nice [11:30:07] <_joe_> 4k messages processed/s [11:30:25] <_joe_> in about 20 hours we'll be over the backlog [11:30:34] more replicas? [11:30:41] <_joe_> actually 1 hour [11:30:47] <_joe_> claime: it wouldn't help with this [11:31:02] Ah yeah because it's pinned thingies right? [11:31:11] can't remember how they're actually called [11:31:12] <_joe_> 1 consumer per topic yes [11:31:14] yeah [11:31:18] but ~1h is fine, isn't it? [11:31:22] <_joe_> yes [11:31:27] jayme: yeah, I reacted to the 20h [11:31:30] 1h is perfectly ok [11:31:30] <_joe_> if it keeps, up, let's see in 10 minutes [11:31:39] cool. so I can pack my computer and go see Fabio now :) [11:31:42] given it's been borked for ~20h [11:31:43] I don't think this can be a consequence of T358830 since I think that change hasn't ridden the train yet [11:31:54] <_joe_> I'm looking at sum(irate(changeprop_normal_rule_processing_count[5m])) [11:32:13] <_joe_> Emperor: uhm, what servers are those IPs for? [11:32:21] Am I right that 10.194.x.x is our k8s range? Is it possible there's some credential gone awry/mis-set (remember the two swift clusters are distinct?) [11:32:35] <_joe_> Emperor: it's very possible ofc [11:32:36] it's probably thumbor pods [11:32:37] max mem usage stable at around 2.2GB [11:32:42] you are correct btw [11:32:45] grep -F ' 401 ' /var/log/swift/proxy-access.log | cut -d ' ' -f 6 | sort | uniq -c | sort -rn -k 1,1 # what I'm looking at [11:32:56] <_joe_> Emperor: give us one ip? [11:33:04] 10.194.134.119 [11:33:07] <_joe_> akosiaris: can you look if those are thumbor pods? [11:33:18] <_joe_> while I check changeprop for a bit more [11:33:31] _joe_: changeprop looks CPU bound now [11:33:39] <_joe_> so actually 1 hour is a bit optimistic [11:33:40] it's getting throttled [11:33:48] <_joe_> claime: it's ok for now [11:33:58] <_joe_> it's working in overdrive already [11:34:00] yeah, not heavily so it's all right [11:34:03] <_joe_> keep an eye on parsoid [11:34:12] <_joe_> I'll keep an eye on purges at the edge [11:34:33] thumbor thumbor-main-5b8b5855ff-8crtv 11/11 Running 111 (30h ago) 27d 10.194.134.119 mw2350.codfw.wmnet [11:34:36] Emperor: ^ [11:34:38] so that's expected [11:35:13] akosiaris: it's surely not expected for thumbor to be getting lots of 401 from swift? Particularly it wasn't happening in codfw pre-switch [11:35:27] Oh, wait, is thumbor trying to use codfw-credentials to talk to eqiad-swift? [11:35:32] oh I meant for swift to see the IP you pasted [11:35:36] _joe_: expected bump in slow processing for parsoid [11:35:38] <_joe_> oh yes [11:35:58] <_joe_> Emperor: that might be it if thumbor uses the discovery record, which is dumb [11:36:05] Emperor: let me doublecheck that [11:36:28] <_joe_> akosiaris: if thumbor uses swift.discovery.wmnet anywhere, that's the problem [11:36:35] <_joe_> and we need to repool it *now* [11:36:45] SWIFT_HOST = 'https://swift.discovery.wmnet' [11:36:46] yes it is [11:36:51] I am repooling it [11:36:52] aha [11:37:00] Emperor: good find [11:37:07] I'm glad I kept poking that niggling "this doesn't look right" :) [11:37:13] Emperor: very good catch [11:37:34] Should we change thumbor's values-{eqiad,codfw}.yaml to point to the dc-local records? [11:37:59] swift Active/Active pooled [11:37:59] swift-ro Active/Active pooled [11:37:59] swift-rw Active/Passive pooled [11:38:00] <_joe_> claime: yes but for now let's repool swift [11:38:04] all 3 only in eqiad [11:38:14] <_joe_> yeah repool all of them :/ [11:38:14] _joe_: ack [11:39:23] ok, something needs fixing there. Glancing at the config, thumbor doesn't have the notion of "fetch from this read-only swift, put result in this r/w swift" [11:39:44] <_joe_> akosiaris: thumbor should only use the local swift, ever [11:39:46] _joe_: re: changeprop, big expected bump of rps to mw-api-int as well [11:39:47] <_joe_> it's by design [11:40:04] <_joe_> if it's not, that's gonna cause a ton of issues [11:40:05] local swift> +1 [11:40:22] (like rps x5 for mw-api-int) [11:40:22] why did it end up with a discovery record in the config then... hmmm [11:40:23] (remember that the two ms swift clusters are entirely distinct, separate credentials, the works) [11:40:29] <_joe_> akosiaris: mistake [11:40:35] <_joe_> 100% a mistake [11:40:35] because a lot of things now use the core parser and not parsoid anymore [11:40:50] <_joe_> is anyone repooling swift? [11:41:02] I assume effie is on top of it? [11:41:05] I am [11:41:16] <_joe_> ah sorry I missed your message effie <3 [11:41:27] I was looking if thumbor is pooled in both dcs [11:41:40] <_joe_> effie: thumbor isn't reached via discovery [11:41:41] https://grafana.wikimedia.org/goto/FZJjxAJIz?orgId=1 < Hello events [11:41:46] _joe_: yeah [11:41:52] <_joe_> it's a local attachment to the swift cluster [11:41:54] I had to re-remember [11:42:13] <_joe_> claime: is mw doing allright? [11:42:18] peachy [11:42:20] <_joe_> mw-api-int, I mean [11:42:23] <_joe_> cool [11:42:28] claime: changeprop (the instance, not the jobqueue) needs to die... [11:42:38] akosiaris: very much agreed [11:43:06] <_joe_> so as I feared, the backlog isn't going down rn [11:43:09] _joe_: how is the edge doing? [11:43:11] <_joe_> because we're generating more links [11:43:27] <_joe_> claime: uhm we should be probably re-forwarding to your last chart version heh [11:43:48] Oh it's still there x) [11:43:53] yeah we should [11:43:53] it dropped off drastically for a while https://grafana-rw.wikimedia.org/d/000300/change-propagation?orgId=1&refresh=1m&from=now-1h&to=now&forceLogin&viewPanel=27 [11:43:57] patch incoming [11:44:01] I suppose it's that page again ? [11:44:05] it is [11:44:12] claime: just revert my version: pin and deploy [11:44:28] yeah [11:44:34] page 48 of the pdf done, how many times again ? [11:44:42] <_joe_> too many :D [11:45:28] 2000^2000 [11:45:33] (yeah) [11:45:38] Emperor: errors are going down now I reckon ? [11:46:42] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1013029 [11:46:57] what's peculiar is that the swift.discovery.wmnet change was merged on Dec 08, 2022 [11:47:08] so, we 've been through 2 switchovers without noticing [11:47:09] ? [11:47:40] <_joe_> akosiaris: we never depooled swift before IIRC [11:47:51] <_joe_> during the switchover [11:47:53] and if end-users noticed... we never heard much from them [11:47:58] <_joe_> because that record made little sense [11:48:01] <_joe_> akosiaris: actually... [11:48:15] <_joe_> I think some issues with thumbnailing that were surfaced yesterday... [11:48:19] which is a different broken thing [11:48:29] I will mend thumbors helmfile [11:48:57] did we not? I don't remember telling Clement to handle swift as an exception 1y ago. [11:49:01] * akosiaris digging into tasks [11:49:21] <_joe_> akosiaris: the cookbook in the past had an exception for swift [11:49:24] Reverting version pin in eqiad [11:49:28] <_joe_> it was listed as one of the excluded services [11:49:34] <_joe_> claime: ack [11:50:26] effie: still non-zero, I'll give it a bit longer to settle [11:50:31] New release deployed [11:50:43] <_joe_> Emperor: non-zero but reduced? [11:52:01] 'swift', # temporary, undergoing rebalancing (T287539#7339799) [11:52:01] 'swift-ro', # per above [11:52:01] T287539: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 [11:52:16] the exclusion was temporary and was removed before last years March switchover [11:52:56] https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?from=now-15m&to=now-1m&orgId=1&var-site=All&viewPanel=39 tempauth now looks good thanks effie [11:52:57] confd firing for swift-rw, see -operations [11:53:11] ok, I got to take a break actually. I am hungry and I need to pickup my daughter from school [11:53:59] mw-api-int will fire for saturation, but latency's all right [11:54:15] (hovering around 60%) [11:54:20] <_joe_> Emperor, effie https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1013033 [11:54:52] Have another big file to exclude actually [11:54:58] Parsing File:Eaddy - English(...) - NARA - 108642302 (page 1309).jpg was slow, took 9.99 seconds [11:55:05] patch incoming [11:55:06] <_joe_> oh christ [11:55:45] ? [11:55:52] Emperor: changeprop stuff [11:55:57] ah, OK [11:57:40] <_joe_> Emperor: yeah sorry I wasn't invoking you [11:57:45] x) [11:57:51] <_joe_> but rather a generic deity who should help us right now [11:58:02] just copped another actually [11:58:08] srsly [11:58:17] <_joe_> claime: stop playing whack-a-mole [11:58:21] <_joe_> I have a proposal [11:58:38] <_joe_> a regexp for anything with (page \d\d\d+) in the title [11:58:49] Hmmm [12:00:28] will changeprop's regex engine take that? [12:00:41] <_joe_> no idea! [12:00:46] Let's find out! [12:01:01] <_joe_> try it in staging first [12:02:59] Emperor: we will leave swift in both DCs pooled for today and I am updating thumbor right now. I will attempt to depool codfw again tomorrow morning along with one more service that was naughty during the switchover [12:08:11] _joe_: Actually that'll still trigger snowballs, just smaller ones [12:08:19] (the \d\d\d+ idea) [12:15:16] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1013035 [12:16:34] It'll be better though [12:21:18] claime: happy to +1 it, since we are experimenting anyway [12:23:38] effie: ack, thanks for the update [12:24:20] <_joe_> claime: I would wait 1 hour or so [12:25:55] Yeah there hasn't been any more these matching files for a bit [12:27:55] summary definitions rerenders isn't going down though :( [12:32:54] I think mobileapps is bottlenecking on memoty [12:33:08] https://grafana.wikimedia.org/goto/fW3uL01Iz?orgId=1 [12:33:49] <_joe_> we've halved the backlog btw [12:34:36] the total backlog? [12:34:52] because some topics aren't going down at all [12:35:06] I see a bunch of them that did resolve though so that's good [12:35:27] <_joe_> yes [12:35:32] <_joe_> the total backlog [12:35:55] <_joe_> it's down from 7.92M to 4.98M [12:36:01] <_joe_> over 1 hour [12:36:32] <_joe_> but now the decrease has slowed, because we're generating a ton of jobs [12:37:54] <_joe_> so I guess I wasn't that wrong saying 20 hours [12:39:19] swift related question [12:39:27] should swift-rw be pooled in both dcs? [12:39:32] I thought it was a/p [12:39:48] afaik it used to be, not anymore, but ask someone else [12:40:03] maybe I am confusing it with cassandra/aqs [12:40:11] <_joe_> claime: swift-rw points to what? [12:40:15] <_joe_> the main swift cluster? [12:40:58] same as the other swift dnsdics [12:41:09] <_joe_> yes it does [12:41:12] 10.2.x.27 [12:41:13] <_joe_> so it's meaningless [12:41:27] <_joe_> I don't know why we introudced it, probably historical reasons [12:41:40] Yeah there's a TODO remove from dns on it in service.yaml [12:41:59] but since it's pooled like that and is active_active: false in service.yaml, it's alerting [12:42:17] or rather because of its dns record I guess [12:42:18] good catch then as an actionable to cleanup [12:42:21] <_joe_> ok, set it to only eqiad I guess? [12:42:24] yeah [13:34:50] Do we have a phab task for (this) switchover and/or associated issues? [13:35:51] Emperor: https://phabricator.wikimedia.org/T357547 [13:42:24] TY :) [13:45:03] fyi, switchover in 15m, please don't run cookbooks for the next 30-45m to avoid interfering with the switchover. [13:45:26] good luck team! [13:46:05] akosiaris: I suggest to cross post this in the two private channels and dcops one [13:47:32] good point, thanks [13:48:36] * volans learned it from past experience ;) [13:50:16] !oncall-now [13:50:16] Oncall now for team SRE, rotation business_hours: [13:50:17] j.ayme, a.kosiaris, h.erron, j.hathaway [13:50:44] Dear SREs, we have started the switchover preliminary work [13:51:14] good luck [13:52:34] break a leg [13:52:39] is it expected/known that cumin1002 has a root tmux with a dry-run switchdc cookbook still running? [13:53:07] akosiaris: where is the switchover coordination going to happen? here or in -operations? [13:53:50] taavi: could be teh test one, we can ignore [13:54:07] marostegui: I would suggest we post anything here if we have to, unless anyone objects [13:54:17] ok [13:55:27] ok. Overall, if you see anyone posting things into other channels and those things are related to the switchover, ask them to repeat here. [14:00:55] mwmaint2002:~$ systemctl list-units 'mediawiki_job_*' says failed [14:06:04] ● mediawiki_job_growthexperiments-listTaskCounts.service loaded failed failed MediaWiki periodic job growthexperiments-listTaskCounts [14:06:07] etc etc [14:06:19] reset-failed, stop [14:06:22] imo [14:06:58] the stop thing has happened fine [14:07:00] doesn't seem like a big deal, as long as it is not running [14:07:03] <_joe_> I think the problem is a shell command launched by bvibber [14:07:14] <_joe_> which is running a script [14:08:00] looks unrelated ? [14:08:15] we 'll have to kill it though I think [14:08:26] afaict those units are all `Main process exited, code=killed, status=15/TERM`, `Failed with result 'signal'.` [14:08:46] which is fine [14:08:47] so they're just sad because we killed them, seems fine to clear and ignore [14:08:48] yeah [14:08:52] I think we do need to kill the script yeah [14:08:53] <_joe_> akosiaris: it needs to be killed [14:08:59] ok, let me reset-failed all these then [14:09:00] I am killking it [14:09:06] cool, thanks [14:09:35] 0 loaded units listed. Pass --all to see loaded but inactive units, too. [14:09:47] so, reset-failed worked fine apparently we are good to proceed [14:10:09] (and next switchover maybe we won't be using systemd units so that won't be an issue 🤞) [14:10:46] 😱 [14:10:53] lgtm to continue [14:10:55] Need to rerun the stopmaintenance imo [14:11:13] yep [14:11:48] <_joe_> it won't work even if we re-run it, let me ping brooke [14:12:16] killing pid 4551 (bash rerun13.sh) should get rid of it I think [14:12:19] _joe_: why wouldn't it? the originating bash script has been killed, what's left is mwscript? [14:12:34] <_joe_> oh it was now I think [14:12:39]