[06:31:29] <legoktm>	 hi folks, it seems at least 4 restbase servers have not received any deployments since at least October
[06:31:31] <legoktm>	 https://phabricator.wikimedia.org/T333069#8737298
[06:33:52] <legoktm>	 I've marked it as UBN for now, since it presumably means that something else has gone terribly wrong
[06:37:50] <_joe_>	 legoktm: if I had to bet, they're not in the scap configuration for restbase
[06:38:03] <_joe_>	 good morning :)
[06:38:33] <_joe_>	 let me ping nemo-yiannis and duesen on the task :)
[06:39:18] <legoktm>	 morning :)
[06:39:26] <legoktm>	 and seems like it: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/restbase/deploy/+/refs/heads/master/scap/targets - if that's the correct file
[06:40:03] <_joe_>	 yes it is
[06:40:21] <_joe_>	 I can write a patch for it, but I won't be deploying restbase now
[06:43:19] <_joe_>	 sigh this is much worse than expected
[06:43:52] <legoktm>	 AFAICT they are running the same version they were installed with
[06:43:55] <_joe_>	 i mean this was the worse footgun of all of scap3
[06:44:01] <_joe_>	 legoktm: correct
[06:44:07] <_joe_>	 that includes restbase1016
[06:44:43] <_joe_>	 which at least was reinstalled in march
[06:46:00] <legoktm>	 I can't find tickets about putting these servers into service, just the DC-ops tickets
[06:46:15] <legoktm>	 (T294372, T301399)
[06:46:15] <stashbot>	 T301399: Q3:(Need By: TBD) rack/setup/install restbase2027 - https://phabricator.wikimedia.org/T301399
[06:46:15] <stashbot>	 T294372: Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372
[06:46:51] <_joe_>	 yeah so, patch incoming
[06:48:58] <_joe_>	 legoktm: https://gerrit.wikimedia.org/r/c/mediawiki/services/restbase/deploy/+/904006 if you want to +1 it
[06:49:18] <legoktm>	 done
[06:49:49] <_joe_>	 ok, given how outdated it all is, I would be inclined to deploy myself, but I'm not sure what our status is atm
[06:50:36] <_joe_>	 so I might instead make those servers pick up the code
[06:54:05] <legoktm>	 :fingers_crossed:, I'm going to dip out now, I was only supposed to be asleep 2 hours ago -.-
[06:54:11] <legoktm>	 this was fun, let's do it again in a few years ;-)
[06:54:55] <_joe_>	 ahaha
[06:54:59] <_joe_>	 good night!
[07:03:11] <legoktm>	 clickbait blogpost: "How I discovered a UBN issue on Wikimedia servers using Rust"
[07:06:55] <_joe_>	 ahahah
[09:02:43] <nemo-yiannis>	 _joe_: there is a patch that also needs to be deployed, i can take care of the deployment
[09:03:30] <_joe_>	 nemo-yiannis: i did fix things already, the code there was 1.5 yrs old
[09:03:48] <nemo-yiannis>	 ah, ok i thought there was a pending deployment to apply the changes 
[09:04:09] <nemo-yiannis>	 thanks _joe_ 
[09:28:06] <elukey>	 hello folks, I moved kafka-jumbo1001 to PKI TLS certs for the kafka broker. All the clients should support the new TLS cert, but in case you notice something wrong lemme know
[09:55:24] <XioNoX>	 elukey: TIL `$ phaste --help`
[09:59:10] <vgutierrez>	 hmmm alerts.wm.o is screaming regarding not being able to reach prometheus.svc.esams.wmnet
[09:59:21] <elukey>	 XioNoX: my TIL was a little bit more brutal :D
[10:00:02] <volans>	 vgutierrez: WIP AFAIK, see -operations
[10:00:02] <godog>	 vgutierrez: yeah denisse is upgrading that host to bullseye
[10:00:11] <vgutierrez>	 oh ok, my fault
[10:00:36] <denisse>	 Yes, I'm working with that host.
[10:01:02] <volans>	 do we still have monitoring in esams during the upgrade?
[10:01:52] <godog>	 not prometheus no
[10:02:36] <denisse>	 volans: If you require monitoring in esams I can continue with the update later.
[10:03:13] <volans>	 I don't personally require it, I think we need it all the time :)
[10:06:16] <volans>	 those in the PoPs are also VMs, so shouldn't be easy to have a new one with the new OS and failover to it once it's all set and ready without loosing observability on the whole DC?
[10:06:54] <denisse>	 volans: Good point, I'll change the approach. Thanks.
[10:11:02] <godog>	 true in theory, I'm not sure going through a VM cycle (commission a new VM, decom the old one) and possibly waiting for enough data to be accumulated, or do a data transfer
[10:11:09] <godog>	 vs an in place upgrade
[11:50:11] <godog>	 TIL idempotency-key https://datatracker.ietf.org/doc/html/draft-ietf-httpapi-idempotency-key-header-02
[12:25:13] <jynus>	 there is some weird trend of 502s on upload going on: https://grafana.wikimedia.org/goto/GNYApQB4z?orgId=1
[12:26:11] <jynus>	 ^ not sure at which layer, but CCing vgutierrez Emperor for now just as a heads up
[12:28:20] <jynus>	 looking at: https://grafana.wikimedia.org/goto/PkqYpwfVz?orgId=1 it looks like a backend issue (swift-codfw)
[12:29:44] <jynus>	 https://grafana.wikimedia.org/goto/1EgwtwBVk?orgId=1
[12:31:25] <jynus>	 very uneven load on proxies
[12:32:22] <jynus>	 https://grafana.wikimedia.org/goto/mO_jpQB4z?orgId=1
[12:33:28] <Emperor>	 I'm not seeing an increase in 5xx from the proxy servers
[12:33:52] <Emperor>	 https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1 cf the Proxy errors per server panel
[12:34:56] <jynus>	 I see 15 errors/s for swift responses since 11:10
[12:35:48] <jynus>	 not sure if you mean you don't see what I mean on the graphs I shared or that you don see those on logs?
[12:38:42] <Emperor>	 I mean that, per the Proxy errors per server panel, I don't see a rise in errors from the proxy servers. Which makes me suspect it might be thumbor errors being passed through swift to ATS
[12:39:31] <jynus>	 see why I think the proxies are in a bad state: https://phab.wmfusercontent.org/file/data/x3wu7u3c5txdec4araef/PHID-FILE-y26wi6miyv4vebpg645b/Screenshot_20230329_143811.png
[12:40:08] <jynus>	 2011 seems way more loaded than the others
[12:40:48] <jynus>	 since around 11h
[12:42:20] <Emperor>	 loading of the swift proxies is often uneven (I'm not really sure why), but a system load of 7ish on ms-fe2011 is within expected range, I think
[12:44:00] <Emperor>	 I'll run a rolling-restart of the codfw proxies just in case, though.
[12:45:40] <jynus>	 to me the spike of 504 folloew by high-ish base of 502 smeels like some breakage (doesn't have to be of swift- could be thumbor or something else within the upload realm)
[12:51:27] <jynus>	 thumbor stats are even weirder- there is a big decrease of 429s at 11h, but that is in eqiad - which should have nothing to do with this
[12:53:16] <jynus>	 https://grafana.wikimedia.org/goto/0HeWTwB4z?orgId=1
[12:54:03] <jynus>	 sadly, there is no codfw graph: https://grafana.wikimedia.org/goto/V4FnTwB4k?orgId=1
[12:54:39] <jynus>	 oh, it is, below
[12:55:03] <claime>	 I was about to say
[12:55:12] <claime>	 :)
[12:55:13] <jynus>	 Emperor: how is the rolling restart progressing? things seem to be changing
[12:56:02] <jynus>	 latency going down
[12:56:18] <Emperor>	 'tis done
[12:56:59] <jynus>	 5XX dropped
[12:57:47] <Emperor>	 if it was that, then 10 points to anyone who can explain why if the swift proxies were producing errors there are almost no errors in https://grafana.wikimedia.org/goto/H2KcowB4z?orgId=1 
[12:58:39] <jynus>	 assuming you really think that is real- that looks like a ticket to me
[12:58:57] <Emperor>	 because I despair of randomly turning things off and on again and also despair of the swift dashboard being entirely full of lies no matter how often I try and make it better
[12:59:26] <volans>	 what's swift_proxy_server_errors_total recording?
[13:01:18] <Emperor>	   - match: swift.*.*.proxy-server.errors
[13:01:35] <Emperor>	 in  hieradata/role/common/swift/proxy.yaml
[13:03:05] <volans>	 yeha I meant what's that metric reporting, maybe is not counting the same things
[13:09:16] <jynus>	 honestly, I am very happy and would like to thank Emperor for fixing a relatively small amount of errors so quickly
[13:09:57] <vgutierrez>	 that must be wrong IMHO
[13:10:06] <vgutierrez>	 I mean https://grafana.wikimedia.org/goto/H2KcowB4z?orgId=1  
[13:10:21] <Emperor>	 vgutierrez: yeah, thanking me for anything is obviously wrong ;p
[13:10:26] <Emperor>	 :)
[13:10:35] <vgutierrez>	 ATS reports 10-15 rps of errors per second from swift, that doesn't correlate at all with that dashboard
[13:10:45] <Emperor>	 volans: IHNI the metrics for swift all predate me
[13:10:46] <vgutierrez>	 10-15 rps (per site using codfw)
[13:11:04] <jynus>	 vgutierrez: I don't know much about upload arch, but couldn't it be thumbor ?
[13:11:14] <vgutierrez>	 Emperor: we should graph the 5xx being triggered earlier in your side, at the TLS termination layer
[13:11:31] <vgutierrez>	 jynus: those still go via swift
[13:11:44] <jynus>	 I see
[13:11:47] <vgutierrez>	 ATS doesn't hit thumbor on its own
[13:12:04] <Emperor>	 [elide my usual rant about that]
[13:12:05] <vgutierrez>	 ATS for upload just have two backends, kartotherian for maps and swift for upload
[13:12:08] <jynus>	 yeah, so let's create a task, and we can help, Emperor
[13:12:14] <claime>	 IHNI: International Hostage Negotiation Institute according to ddg (I know it's I Have No Idea but still funny)
[13:13:06] * Emperor not sure why claime is asking Dandong Langtou Airport about acronyms ;p
[13:13:16] <claime>	 x)
[13:13:30] <jynus>	 one interesting thing is- 503 were correctly detected
[13:14:02] <jynus>	 it was 504 and 502 that didn't correlate
[13:14:42] <jynus>	 see: https://grafana.wikimedia.org/goto/6WdKAwfVz?orgId=1
[13:30:30] <vgutierrez>	 jynus: probably those 502 aren't coming from swift itself?
[13:31:04] <vgutierrez>	 jynus: swift has a TLS terminator in front of it, that could perfectly be returning those bad gateway errors if the backend service (swift frontends) aren't able to handle the requests
[13:31:16] <vgutierrez>	 buy I'm speculating here 
[13:31:46] <jynus>	 yeah, we need to look at the logs and see the context
[13:37:40] <Emperor>	 nginx doesn't seem to log many errors at all when I've looked
[16:44:43] <dcaro>	 brett: there's an unmerged puppet patch, can I merge it? (BCornwall: varnish: Change systemd units Requires to BindsTo (d7ed4f66e1))
[16:45:04] <brett>	 Sure, was just about to!
[16:45:10] <brett>	 dcaro: Thanks
[16:45:18] <dcaro>	 👍
[16:46:09] <dcaro>	 (thanks for the reviews j.bond!)
[18:40:53] <mutante>	 does anyone know more about the real difference between https://query-preview.wikidata.org/  and regular https://query.wikidata.org/ ? Like.. how much does it matter if the "preview" one was down, who uses it?
[18:44:55] * inflatador should know, but doesn't ;P
[18:45:47] <inflatador>	 mutante I'll check with my team and get back to you
[18:56:51] <mutante>	 inflatador: :) thank you!
[18:56:58] <inflatador>	 mutante there's some context here https://phabricator.wikimedia.org/T266470
[18:57:07] <mutante>	 I am just trying to decide if it deserves its own monitoring.
[18:57:10] <mutante>	 the "preview" one
[18:57:17] <mutante>	 the regular one gets it for sure
[18:57:30] <mutante>	 ack, thanks
[18:58:10] <inflatador>	 I don't think so, it runs against a test instance
[18:58:23] <inflatador>	 of wdqs that is
[19:00:08] <mutante>	 inflatador: alright, yea, it seems to point to just a single host, as opposed to a discovery name. I will just mark this as "skipped"
[19:00:49] <inflatador>	 Cool, sounds good
[19:01:41] <mutante>	 I am detecting famous words in the ticket "can be dealt with at a later point" :)
[19:02:09] <mutante>	 it's some law that this translates to never, but it's ok :)
[20:57:25] <ryankemper>	 mutante: yes, IME the law applies without fail :P
[21:01:22] <mutante>	 :)