[06:31:29] hi folks, it seems at least 4 restbase servers have not received any deployments since at least October [06:31:31] https://phabricator.wikimedia.org/T333069#8737298 [06:33:52] I've marked it as UBN for now, since it presumably means that something else has gone terribly wrong [06:37:50] <_joe_> legoktm: if I had to bet, they're not in the scap configuration for restbase [06:38:03] <_joe_> good morning :) [06:38:33] <_joe_> let me ping nemo-yiannis and duesen on the task :) [06:39:18] morning :) [06:39:26] and seems like it: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/restbase/deploy/+/refs/heads/master/scap/targets - if that's the correct file [06:40:03] <_joe_> yes it is [06:40:21] <_joe_> I can write a patch for it, but I won't be deploying restbase now [06:43:19] <_joe_> sigh this is much worse than expected [06:43:52] AFAICT they are running the same version they were installed with [06:43:55] <_joe_> i mean this was the worse footgun of all of scap3 [06:44:01] <_joe_> legoktm: correct [06:44:07] <_joe_> that includes restbase1016 [06:44:43] <_joe_> which at least was reinstalled in march [06:46:00] I can't find tickets about putting these servers into service, just the DC-ops tickets [06:46:15] (T294372, T301399) [06:46:15] T301399: Q3:(Need By: TBD) rack/setup/install restbase2027 - https://phabricator.wikimedia.org/T301399 [06:46:15] T294372: Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 [06:46:51] <_joe_> yeah so, patch incoming [06:48:58] <_joe_> legoktm: https://gerrit.wikimedia.org/r/c/mediawiki/services/restbase/deploy/+/904006 if you want to +1 it [06:49:18] done [06:49:49] <_joe_> ok, given how outdated it all is, I would be inclined to deploy myself, but I'm not sure what our status is atm [06:50:36] <_joe_> so I might instead make those servers pick up the code [06:54:05] :fingers_crossed:, I'm going to dip out now, I was only supposed to be asleep 2 hours ago -.- [06:54:11] this was fun, let's do it again in a few years ;-) [06:54:55] <_joe_> ahaha [06:54:59] <_joe_> good night! [07:03:11] clickbait blogpost: "How I discovered a UBN issue on Wikimedia servers using Rust" [07:06:55] <_joe_> ahahah [09:02:43] _joe_: there is a patch that also needs to be deployed, i can take care of the deployment [09:03:30] <_joe_> nemo-yiannis: i did fix things already, the code there was 1.5 yrs old [09:03:48] ah, ok i thought there was a pending deployment to apply the changes [09:04:09] thanks _joe_ [09:28:06] hello folks, I moved kafka-jumbo1001 to PKI TLS certs for the kafka broker. All the clients should support the new TLS cert, but in case you notice something wrong lemme know [09:55:24] elukey: TIL `$ phaste --help` [09:59:10] hmmm alerts.wm.o is screaming regarding not being able to reach prometheus.svc.esams.wmnet [09:59:21] XioNoX: my TIL was a little bit more brutal :D [10:00:02] vgutierrez: WIP AFAIK, see -operations [10:00:02] vgutierrez: yeah denisse is upgrading that host to bullseye [10:00:11] oh ok, my fault [10:00:36] Yes, I'm working with that host. [10:01:02] do we still have monitoring in esams during the upgrade? [10:01:52] not prometheus no [10:02:36] volans: If you require monitoring in esams I can continue with the update later. [10:03:13] I don't personally require it, I think we need it all the time :) [10:06:16] those in the PoPs are also VMs, so shouldn't be easy to have a new one with the new OS and failover to it once it's all set and ready without loosing observability on the whole DC? [10:06:54] volans: Good point, I'll change the approach. Thanks. [10:11:02] true in theory, I'm not sure going through a VM cycle (commission a new VM, decom the old one) and possibly waiting for enough data to be accumulated, or do a data transfer [10:11:09] vs an in place upgrade [11:50:11] TIL idempotency-key https://datatracker.ietf.org/doc/html/draft-ietf-httpapi-idempotency-key-header-02 [12:25:13] there is some weird trend of 502s on upload going on: https://grafana.wikimedia.org/goto/GNYApQB4z?orgId=1 [12:26:11] ^ not sure at which layer, but CCing vgutierrez Emperor for now just as a heads up [12:28:20] looking at: https://grafana.wikimedia.org/goto/PkqYpwfVz?orgId=1 it looks like a backend issue (swift-codfw) [12:29:44] https://grafana.wikimedia.org/goto/1EgwtwBVk?orgId=1 [12:31:25] very uneven load on proxies [12:32:22] https://grafana.wikimedia.org/goto/mO_jpQB4z?orgId=1 [12:33:28] I'm not seeing an increase in 5xx from the proxy servers [12:33:52] https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1 cf the Proxy errors per server panel [12:34:56] I see 15 errors/s for swift responses since 11:10 [12:35:48] not sure if you mean you don't see what I mean on the graphs I shared or that you don see those on logs? [12:38:42] I mean that, per the Proxy errors per server panel, I don't see a rise in errors from the proxy servers. Which makes me suspect it might be thumbor errors being passed through swift to ATS [12:39:31] see why I think the proxies are in a bad state: https://phab.wmfusercontent.org/file/data/x3wu7u3c5txdec4araef/PHID-FILE-y26wi6miyv4vebpg645b/Screenshot_20230329_143811.png [12:40:08] 2011 seems way more loaded than the others [12:40:48] since around 11h [12:42:20] loading of the swift proxies is often uneven (I'm not really sure why), but a system load of 7ish on ms-fe2011 is within expected range, I think [12:44:00] I'll run a rolling-restart of the codfw proxies just in case, though. [12:45:40] to me the spike of 504 folloew by high-ish base of 502 smeels like some breakage (doesn't have to be of swift- could be thumbor or something else within the upload realm) [12:51:27] thumbor stats are even weirder- there is a big decrease of 429s at 11h, but that is in eqiad - which should have nothing to do with this [12:53:16] https://grafana.wikimedia.org/goto/0HeWTwB4z?orgId=1 [12:54:03] sadly, there is no codfw graph: https://grafana.wikimedia.org/goto/V4FnTwB4k?orgId=1 [12:54:39] oh, it is, below [12:55:03] I was about to say [12:55:12] :) [12:55:13] Emperor: how is the rolling restart progressing? things seem to be changing [12:56:02] latency going down [12:56:18] 'tis done [12:56:59] 5XX dropped [12:57:47] if it was that, then 10 points to anyone who can explain why if the swift proxies were producing errors there are almost no errors in https://grafana.wikimedia.org/goto/H2KcowB4z?orgId=1 [12:58:39] assuming you really think that is real- that looks like a ticket to me [12:58:57] because I despair of randomly turning things off and on again and also despair of the swift dashboard being entirely full of lies no matter how often I try and make it better [12:59:26] what's swift_proxy_server_errors_total recording? [13:01:18] - match: swift.*.*.proxy-server.errors [13:01:35] in hieradata/role/common/swift/proxy.yaml [13:03:05] yeha I meant what's that metric reporting, maybe is not counting the same things [13:09:16] honestly, I am very happy and would like to thank Emperor for fixing a relatively small amount of errors so quickly [13:09:57] that must be wrong IMHO [13:10:06] I mean https://grafana.wikimedia.org/goto/H2KcowB4z?orgId=1 [13:10:21] vgutierrez: yeah, thanking me for anything is obviously wrong ;p [13:10:26] :) [13:10:35] ATS reports 10-15 rps of errors per second from swift, that doesn't correlate at all with that dashboard [13:10:45] volans: IHNI the metrics for swift all predate me [13:10:46] 10-15 rps (per site using codfw) [13:11:04] vgutierrez: I don't know much about upload arch, but couldn't it be thumbor ? [13:11:14] Emperor: we should graph the 5xx being triggered earlier in your side, at the TLS termination layer [13:11:31] jynus: those still go via swift [13:11:44] I see [13:11:47] ATS doesn't hit thumbor on its own [13:12:04] [elide my usual rant about that] [13:12:05] ATS for upload just have two backends, kartotherian for maps and swift for upload [13:12:08] yeah, so let's create a task, and we can help, Emperor [13:12:14] IHNI: International Hostage Negotiation Institute according to ddg (I know it's I Have No Idea but still funny) [13:13:06] * Emperor not sure why claime is asking Dandong Langtou Airport about acronyms ;p [13:13:16] x) [13:13:30] one interesting thing is- 503 were correctly detected [13:14:02] it was 504 and 502 that didn't correlate [13:14:42] see: https://grafana.wikimedia.org/goto/6WdKAwfVz?orgId=1 [13:30:30] jynus: probably those 502 aren't coming from swift itself? [13:31:04] jynus: swift has a TLS terminator in front of it, that could perfectly be returning those bad gateway errors if the backend service (swift frontends) aren't able to handle the requests [13:31:16] buy I'm speculating here [13:31:46] yeah, we need to look at the logs and see the context [13:37:40] nginx doesn't seem to log many errors at all when I've looked [16:44:43] brett: there's an unmerged puppet patch, can I merge it? (BCornwall: varnish: Change systemd units Requires to BindsTo (d7ed4f66e1)) [16:45:04] Sure, was just about to! [16:45:10] dcaro: Thanks [16:45:18] 👍 [16:46:09] (thanks for the reviews j.bond!) [18:40:53] does anyone know more about the real difference between https://query-preview.wikidata.org/ and regular https://query.wikidata.org/ ? Like.. how much does it matter if the "preview" one was down, who uses it? [18:44:55] * inflatador should know, but doesn't ;P [18:45:47] mutante I'll check with my team and get back to you [18:56:51] inflatador: :) thank you! [18:56:58] mutante there's some context here https://phabricator.wikimedia.org/T266470 [18:57:07] I am just trying to decide if it deserves its own monitoring. [18:57:10] the "preview" one [18:57:17] the regular one gets it for sure [18:57:30] ack, thanks [18:58:10] I don't think so, it runs against a test instance [18:58:23] of wdqs that is [19:00:08] inflatador: alright, yea, it seems to point to just a single host, as opposed to a discovery name. I will just mark this as "skipped" [19:00:49] Cool, sounds good [19:01:41] I am detecting famous words in the ticket "can be dealt with at a later point" :) [19:02:09] it's some law that this translates to never, but it's ok :) [20:57:25] mutante: yes, IME the law applies without fail :P [21:01:22] :)