[09:05:38] <topranks>	 hey folks I am about to merge a patch which will add new firewall rules to most hosts
[09:05:40] <topranks>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1007437
[09:06:19] <topranks>	 adds postrouting chain, anyway should be fine been tested N-ways and I will keep an eye on things 
[09:06:27] <topranks>	 but please ping me if anything looks suspicious 
[09:06:51] <topranks>	 !log merge host firewall changes to set default DSCP marking (T339850)
[09:06:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:54] <stashbot>	 T339850: Configure QoS marking and policy across network - https://phabricator.wikimedia.org/T339850
[09:07:23] <XioNoX>	 nice!!
[09:47:30] <dcaro>	 Hi! we have had a few patches to puppet to break all cloud vm puppet runs due to missing variablles in clouds.yaml, there's a test that's disabled from failing that checks for that exact thing, it was disabled long time ago for being very noisy but might prevent cloud-wide puppet failures, does anyone have any reasons against trying to re-enable it again?
[11:24:28] <dcaro>	 btw. this is the patch re-enabling the test if anyone wants to comment there https://gerrit.wikimedia.org/r/c/operations/puppet/+/1051332
[13:11:54] <inflatador>	 ooh, nice to see the QoS stuff
[13:26:51] <topranks>	 inflatador: even nicer it didn't break everything :)
[13:27:28] <topranks>	 I'll present on it at an upcoming SRE meeting on the options we now have
[13:28:00] <inflatador>	 Putting the Q in QoS ;P
[14:53:32] <kamila_>	 brouberol: thanks for your help with benthos, happily resolved :-) https://phabricator.wikimedia.org/T367076#9949907 
[14:54:17] <brouberol>	 nice! So we're batching more between commits?
[14:54:23] <brouberol>	 1s instead of a 100ms?
[14:55:20] <kamila_>	 that did a little bit, but not a whole lot
[14:55:31] <kamila_>	 but the buffer change you +1'd tripled the throughput
[14:55:33] <kamila_>	 which is plenty :D 
[14:56:47] <inflatador>	 .praise brouberol 
[14:58:14] <kamila_>	 and also praise f.abfur who added a buffer to some completely different benthos instance and thus inspired me to look into whether that could help :D 
[15:00:06] <inflatador>	 we need a praisebot for this channel ;P
[15:00:22] <kamila_>	 indeed :D 
[15:04:08] <fabfur>	 :) 
[15:14:16] <brouberol>	 I see, so there's an in-memory batch happening before the messages actually hit the kafka client
[15:14:19] <brouberol>	 is that correct?
[15:19:05] <_joe_>	 kudos!!
[15:19:23] <_joe_>	 kamila_: now you should also integrate the data into grafana :)
[15:19:58] <kamila_>	 brouberol: correct 
[15:20:16] <fabfur>	 !oncall-now 
[15:20:16] <sirenbot>	 Oncall now for team SRE, rotation business_hours:
[15:20:16] <sirenbot>	 a.rnoldokoth, c.white, f.abfur
[15:21:33] <kamila_>	 brouberol: Well, before they get processed internally in benthos in whatever way, in this case it's actually turning them into prometheus metrics, but details :-D
[15:22:09] <kamila_>	 _joe_: where would you like them? :-D
[15:23:18] <brouberol>	 well done! This is interesting! We can see the impact on the output rate in kafka https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&refresh=5m&var-datasource=eqiad+prometheus%2Fops&var-kafka_cluster=logging-eqiad&var-kafka_broker=All&var-topic=--list&var-topic=mediawiki.httpd.accesslog
[15:24:09] <brouberol>	 it's also a testament to Benthos being quite slow (or maybe our processing steps being heavy?) has 15MB/s seems pretty low to me. But that shouldn't take anything away from what you've done and its impact, really. Kudos!
[15:25:19] <brouberol>	 (especially given the host size)
[15:25:35] <kamila_>	 brouberol: yeah, tbh I feel like I don't fully understand the problem, this works but I don't really know why
[15:25:59] <kamila_>	 I mean, I don't know why the throughput was so low in the first place 
[15:26:37] <brouberol>	 which takes me back to the lack of available instrumentation being an issue  
[15:26:47] <kamila_>	 True 
[15:27:27] <kamila_>	 It's parsing json, which I expect can be fairly slow, but it's not really doing anything other than that and that's a bit too slow maybe? Hard to say without metrics, as you say 
[15:28:54] <kamila_>	 I really don't think it's a CPU problem though, it's some kind of benthos internals problem, possibly also something to do with the network, and I don't know more than that 
[15:28:57] <brouberol>	 My hunch is, it's spending quite some time GC-ing, as it's processing loads of small messages
[15:29:07] <brouberol>	 I could be completely wrong
[15:29:10] <kamila_>	 That could be a thing too, true 
[15:29:17] <kamila_>	 Gc metrics exist... 
[15:29:39] <kamila_>	 I'll look into them next time I'm bored :-D
[15:29:47] <brouberol>	 this is usually when someone on reddit or HN will ask for a rust rewrite
[17:25:37] <_joe_>	 brouberol: usually go performance is killed by memory allocations, not even GC
[17:26:14] <_joe_>	 so performant go code tries to be zero-alloc or one-alloc, but ofc it's harder to do with a general-purpose tool like benthos...
[17:27:15] <_joe_>	 ... but at the same time we also process 200k msg/second with benthos in the webrequests pipeline
[17:27:32] <_joe_>	 so I guess some of the stuff we do in this pipeline specifically is the bottleneck
[17:31:53] <ottomata>	 _joe_: o/ you mentioned I should ask you about apache confs.  https://phabricator.wikimedia.org/T353817#9933188  
[17:33:51] <_joe_>	 ottomata: I guess you're ok with me taking a look by end of the week, given you're off for the rest of it?
[17:35:00] <ottomata>	 _joe_:  for sure! thank you.
[17:35:19] <ottomata>	 > 200k msg/second with benthos in the webrequests pipeline
[17:35:19] <ottomata>	 do we ?  I think there are performance issues there too.
[17:35:19] <ottomata>	 https://phabricator.wikimedia.org/T360454
[17:35:19] <ottomata>	 https://phabricator.wikimedia.org/T364379
[17:36:45] <_joe_>	 ottomata: I meant the pipeline from webrequest to turnilo/superset live
[17:37:01] <_joe_>	 I am well aware of those issues :)
[17:38:06] <ottomata>	 oh right! nice.  
[17:44:57] <bd808>	 Is https://debmonitor.wikimedia.org/ returning status 500 responses a known issue?
[17:45:13] <volans>	 wfm
[17:45:34] <cdanis>	 also for me
[17:46:36] <bd808>	 The response I get has "x-cache: cp1114 miss, cp1114 pass" and an 500 status :/
[17:46:40] <volans>	 I don't see 500's in the logs either (apart idp1003 but it's a known issue)
[17:47:14] * volans fixing idp1003 given I'm at it ( elukey FYI )
[17:49:47] <volans>	 {done}
[17:51:16] <volans>	 bd808: are you still able to repro?
[17:52:09] <bd808>	 volans: yes. I'm still getting the 500 response via cp1114 miss, cp1114 pass
[17:53:02] <volans>	 mmmh the only thing I got in the debmonitor backend logs for the last 5 minutes is 5 requests with: invalid request block size: 9174 (max 8192)...skip
[17:54:02] <volans>	 but not even sure if related to your request or your request didn't make it to debmonitor at all
[17:54:58] <bd808>	 I first get redirected to idp to provide auth and then one I have a MOD_AUTH_CAS cookie I get the 500 error
[17:55:06] <volans>	 ahhh
[17:55:46] <volans>	 found the in the apache logs
[17:55:50] <cdanis>	 https://logstash.wikimedia.org/goto/843ba91df76538fda556a4c35a157bb1
[17:55:51] <volans>	 but nothing on the backend
[17:56:09] <cdanis>	 volans: yeah I was just about to say, Varnish reports a 500 with Apache backend
[17:57:03] <volans>	 apache doesn't like to redirect bd808 to the backend... maybe trying to logout and re-login into CAS might help? I'm not sure what's wrong here but seems something related to authn/z
[17:58:00] <volans>	 and yes all the errors are for your user, so seems soemthing sepcific to your auth attempts
[17:58:22] <cdanis>	 bd808: I suggest nuking your cas cookie, at least for debmonitor
[17:58:29] <cdanis>	 I think I've had this happen to me before a few years ago heh
[17:58:41] <bd808>	 ¯\_(ツ)_/¯ I have recreated in Firefox with 3 different sessions and Chrome once.
[17:59:43] <bd808>	 it's no big deal for me personally today, but vexing I guess to not know why
[18:00:08] <cdanis>	 huh
[18:00:35] <cdanis>	 wanna attach gdb to an apache worker on debmonitor?
[18:02:41] <volans>	 soo, checking on disk I see 4 different cookies with Bryan's user
[18:02:48] <volans>	 I could also try to nuke them
[18:02:54] <volans>	 (copy them over for later debug)
[18:05:44] <volans>	 bd808: could you retry once more pleasE?
[18:06:37] <bd808>	 volans: same result here. maybe you see something new in the logs :)
[18:06:56] <volans>	 I just saw the new ticket created with <attribute name="isFromNewLogin"> as true instead of false
[18:07:50] <volans>	 we might need to ask Simon to have a look tomorrow
[18:08:45] <volans>	 I've moved the old tickets for bryan to /home/root/2024-07-03_bd808_cas_tickets for debugging purposes (to be deleted once done)
[18:09:39] * volans doing one last check into debmonitor's db
[18:14:37] <volans>	 no red flags so far, I'll open a task for Simon
[18:20:37] <volans>	 I've created T369205 and subscribed you bd808. Sorry for the inconvenience. If you need some specific data right now let me know and I can get it for you ;)
[18:20:39] <stashbot>	 T369205: Login attempts from bd808 get 500 on Debmonitor - https://phabricator.wikimedia.org/T369205
[18:22:38] <bd808>	 volans: thanks for the effort. I was honestly just going to poke around and see what the tool does these days :)
[18:22:53] <volans>	 pretty much what it did few years back :D
[18:23:29] <volans>	 list packages installed and upgradable and kernels for all hosts and k8s images
[18:24:32] * volans dinner time