[09:05:38] hey folks I am about to merge a patch which will add new firewall rules to most hosts [09:05:40] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1007437 [09:06:19] adds postrouting chain, anyway should be fine been tested N-ways and I will keep an eye on things [09:06:27] but please ping me if anything looks suspicious [09:06:51] !log merge host firewall changes to set default DSCP marking (T339850) [09:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:54] T339850: Configure QoS marking and policy across network - https://phabricator.wikimedia.org/T339850 [09:07:23] nice!! [09:47:30] Hi! we have had a few patches to puppet to break all cloud vm puppet runs due to missing variablles in clouds.yaml, there's a test that's disabled from failing that checks for that exact thing, it was disabled long time ago for being very noisy but might prevent cloud-wide puppet failures, does anyone have any reasons against trying to re-enable it again? [11:24:28] btw. this is the patch re-enabling the test if anyone wants to comment there https://gerrit.wikimedia.org/r/c/operations/puppet/+/1051332 [13:11:54] ooh, nice to see the QoS stuff [13:26:51] inflatador: even nicer it didn't break everything :) [13:27:28] I'll present on it at an upcoming SRE meeting on the options we now have [13:28:00] Putting the Q in QoS ;P [14:53:32] brouberol: thanks for your help with benthos, happily resolved :-) https://phabricator.wikimedia.org/T367076#9949907 [14:54:17] nice! So we're batching more between commits? [14:54:23] 1s instead of a 100ms? [14:55:20] that did a little bit, but not a whole lot [14:55:31] but the buffer change you +1'd tripled the throughput [14:55:33] which is plenty :D [14:56:47] .praise brouberol [14:58:14] and also praise f.abfur who added a buffer to some completely different benthos instance and thus inspired me to look into whether that could help :D [15:00:06] we need a praisebot for this channel ;P [15:00:22] indeed :D [15:04:08] :) [15:14:16] I see, so there's an in-memory batch happening before the messages actually hit the kafka client [15:14:19] is that correct? [15:19:05] <_joe_> kudos!! [15:19:23] <_joe_> kamila_: now you should also integrate the data into grafana :) [15:19:58] brouberol: correct [15:20:16] !oncall-now [15:20:16] Oncall now for team SRE, rotation business_hours: [15:20:16] a.rnoldokoth, c.white, f.abfur [15:21:33] brouberol: Well, before they get processed internally in benthos in whatever way, in this case it's actually turning them into prometheus metrics, but details :-D [15:22:09] _joe_: where would you like them? :-D [15:23:18] well done! This is interesting! We can see the impact on the output rate in kafka https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&refresh=5m&var-datasource=eqiad+prometheus%2Fops&var-kafka_cluster=logging-eqiad&var-kafka_broker=All&var-topic=--list&var-topic=mediawiki.httpd.accesslog [15:24:09] it's also a testament to Benthos being quite slow (or maybe our processing steps being heavy?) has 15MB/s seems pretty low to me. But that shouldn't take anything away from what you've done and its impact, really. Kudos! [15:25:19] (especially given the host size) [15:25:35] brouberol: yeah, tbh I feel like I don't fully understand the problem, this works but I don't really know why [15:25:59] I mean, I don't know why the throughput was so low in the first place [15:26:37] which takes me back to the lack of available instrumentation being an issue [15:26:47] True [15:27:27] It's parsing json, which I expect can be fairly slow, but it's not really doing anything other than that and that's a bit too slow maybe? Hard to say without metrics, as you say [15:28:54] I really don't think it's a CPU problem though, it's some kind of benthos internals problem, possibly also something to do with the network, and I don't know more than that [15:28:57] My hunch is, it's spending quite some time GC-ing, as it's processing loads of small messages [15:29:07] I could be completely wrong [15:29:10] That could be a thing too, true [15:29:17] Gc metrics exist... [15:29:39] I'll look into them next time I'm bored :-D [15:29:47] this is usually when someone on reddit or HN will ask for a rust rewrite [17:25:37] <_joe_> brouberol: usually go performance is killed by memory allocations, not even GC [17:26:14] <_joe_> so performant go code tries to be zero-alloc or one-alloc, but ofc it's harder to do with a general-purpose tool like benthos... [17:27:15] <_joe_> ... but at the same time we also process 200k msg/second with benthos in the webrequests pipeline [17:27:32] <_joe_> so I guess some of the stuff we do in this pipeline specifically is the bottleneck [17:31:53] _joe_: o/ you mentioned I should ask you about apache confs. https://phabricator.wikimedia.org/T353817#9933188 [17:33:51] <_joe_> ottomata: I guess you're ok with me taking a look by end of the week, given you're off for the rest of it? [17:35:00] _joe_: for sure! thank you. [17:35:19] > 200k msg/second with benthos in the webrequests pipeline [17:35:19] do we ? I think there are performance issues there too. [17:35:19] https://phabricator.wikimedia.org/T360454 [17:35:19] https://phabricator.wikimedia.org/T364379 [17:36:45] <_joe_> ottomata: I meant the pipeline from webrequest to turnilo/superset live [17:37:01] <_joe_> I am well aware of those issues :) [17:38:06] oh right! nice. [17:44:57] Is https://debmonitor.wikimedia.org/ returning status 500 responses a known issue? [17:45:13] wfm [17:45:34] also for me [17:46:36] The response I get has "x-cache: cp1114 miss, cp1114 pass" and an 500 status :/ [17:46:40] I don't see 500's in the logs either (apart idp1003 but it's a known issue) [17:47:14] * volans fixing idp1003 given I'm at it ( elukey FYI ) [17:49:47] {done} [17:51:16] bd808: are you still able to repro? [17:52:09] volans: yes. I'm still getting the 500 response via cp1114 miss, cp1114 pass [17:53:02] mmmh the only thing I got in the debmonitor backend logs for the last 5 minutes is 5 requests with: invalid request block size: 9174 (max 8192)...skip [17:54:02] but not even sure if related to your request or your request didn't make it to debmonitor at all [17:54:58] I first get redirected to idp to provide auth and then one I have a MOD_AUTH_CAS cookie I get the 500 error [17:55:06] ahhh [17:55:46] found the in the apache logs [17:55:50] https://logstash.wikimedia.org/goto/843ba91df76538fda556a4c35a157bb1 [17:55:51] but nothing on the backend [17:56:09] volans: yeah I was just about to say, Varnish reports a 500 with Apache backend [17:57:03] apache doesn't like to redirect bd808 to the backend... maybe trying to logout and re-login into CAS might help? I'm not sure what's wrong here but seems something related to authn/z [17:58:00] and yes all the errors are for your user, so seems soemthing sepcific to your auth attempts [17:58:22] bd808: I suggest nuking your cas cookie, at least for debmonitor [17:58:29] I think I've had this happen to me before a few years ago heh [17:58:41] ¯\_(ツ)_/¯ I have recreated in Firefox with 3 different sessions and Chrome once. [17:59:43] it's no big deal for me personally today, but vexing I guess to not know why [18:00:08] huh [18:00:35] wanna attach gdb to an apache worker on debmonitor? [18:02:41] soo, checking on disk I see 4 different cookies with Bryan's user [18:02:48] I could also try to nuke them [18:02:54] (copy them over for later debug) [18:05:44] bd808: could you retry once more pleasE? [18:06:37] volans: same result here. maybe you see something new in the logs :) [18:06:56] I just saw the new ticket created with as true instead of false [18:07:50] we might need to ask Simon to have a look tomorrow [18:08:45] I've moved the old tickets for bryan to /home/root/2024-07-03_bd808_cas_tickets for debugging purposes (to be deleted once done) [18:09:39] * volans doing one last check into debmonitor's db [18:14:37] no red flags so far, I'll open a task for Simon [18:20:37] I've created T369205 and subscribed you bd808. Sorry for the inconvenience. If you need some specific data right now let me know and I can get it for you ;) [18:20:39] T369205: Login attempts from bd808 get 500 on Debmonitor - https://phabricator.wikimedia.org/T369205 [18:22:38] volans: thanks for the effort. I was honestly just going to poke around and see what the tool does these days :) [18:22:53] pretty much what it did few years back :D [18:23:29] list packages installed and upgradable and kernels for all hosts and k8s images [18:24:32] * volans dinner time