[08:20:47] cheers Southparkfan, not exactly out of the woods yet! [08:23:27] Ah yeah, shared curve errors [08:24:00] yes and no, a red herring I believe [08:25:40] Is it a bug in rsyslog? [08:27:17] I don't know yet for sure tbh [08:27:25] ECC would be totally broken if you don't have common curves ;-) [08:28:16] how broken? "can't set up the connection" level of broken? [08:28:29] So unless rsyslog is falling back to, let's say, ciphersuites relying on non-ECC key exchange algorithms, I don't really see how you cannot have shared curves [08:29:08] In most cases, yes [08:29:59] Admittedly I don't know what ciphersuites are supported atm [08:30:15] ok thank you, would no shared curves explain the behaviour in https://phabricator.wikimedia.org/T351710#9349879 ? [08:30:33] i.e. hosts with low traffic have no issues, with high traffic do have issues [08:30:42] traffic being syslog traffic [08:34:08] Do the nodes with low traffix *never* emit these errors? [08:36:03] not afaics [08:41:48] * Southparkfan looks at the Cloud VPS nodes [08:43:06] thank you [08:44:20] I'm testing a rollback to gnutls for the centrallog servers only [08:44:43] at least I see the same errors on Cloud VPS.. [08:45:17] heheh good in some ways [08:45:23] is the new puppet leaf cert signed by an intermediate instead of directly by the root CA? [08:45:37] that's my understanding yes [08:45:38] because gnutls would fail miserably :D [08:45:57] unless the gnutls issue was a verification issue client-side [08:46:16] but I recall having server-side issues with sending the appropriate chain [08:46:32] is this the issue you had in mind? https://phabricator.wikimedia.org/T351181 [08:46:46] in that case we did openssl -> gnutls and things seemed to work [08:49:28] so maybe the client is not as broken as the server is [08:50:07] I'm also thinking of using this opportunity to move rsyslog receiver to a separate rsyslog process/service [08:50:40] can you find out if rsyslog serves a different chain when using gnutls compared to openssl? https://phabricator.wikimedia.org/T324623#8449240 [08:50:55] you can leave the -CAfile out [08:51:35] sure I'll test that [08:52:49] err not now that is [08:53:45] yeah I have to go now [08:54:03] but I've started a tcpdump on the syslog server [08:54:09] thank you for your help so far Southparkfan ! [08:54:14] no problem [08:54:45] I like tinkering with tls and networking ;-p [08:56:52] respect [09:32:36] Hi team! I'd like to have a script that runs periodically for each kafka cluster, to export the replication factor of each topic to prometheus. Is there a specific host that feels right to runs this on? Thanks! [09:44:33] brouberol: would kafka_cluster_Partition_ReplicasCount do ? [09:45:13] I'm aware that doesn't answer your question though :) [09:47:25] ah, I think it would actually! [09:47:35] which would make my life ever easier [09:48:06] sweet! thanks for reaching out [09:53:37] https://thanos.wikimedia.org/graph?g0.expr=max+by+%28topic%29+%28kafka_cluster_Partition_ReplicasCount%7Bcluster%3D%22kafka_jumbo%22%2C+topic%3D%22webrequest_text%22%7D%29&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D does exactly what I need [09:53:42] thanks for saving me time! [10:15:09] amazing [11:23:14] hi all (cc godog ) im going to deploy a change to blackbox::check::* ssl certs [11:23:17] https://gerrit.wikimedia.org/r/c/operations/puppet/+/976273 [11:23:28] ill start by disabling puppet on all prometheus hosts [11:23:46] then deploy to prometheus1005.eqiad.wmnet prometheus1006.eqiad.wmnet [11:24:04] *just 1005 [11:25:49] ack jbond, thanks for the heads up [11:25:53] I'm going to lunch shortly fwiw [11:26:02] godog: ack [11:32:03] godog: I haven't been able to dump the information I need, but it looks like my syslog server is broken as well, so is my puppetmaster [11:32:24] Southparkfan: :( bummer [11:32:49] do you have a Cloud VPS project with local puppetmaster I can (temporarily?) use? [11:33:31] or gather the data in production, with some precautions to not disclose any actual confidential data :) [11:34:02] arguably the latter is not preferred, because rsyslog will be restarted, hence impacting the syslog stream [11:34:19] or... wait a minute.. [11:34:44] yeah I'd rather do the former i.e. cloudvps, I don't have a local puppetmaster free for use atm [11:34:53] ok need to run to lunch, bbiab [11:35:26] ah I think I've found the issue :P [11:39:17] godog: fyi i have reverted that change as it didn;t fix the issue. i thin it might still be usefull to do but i also needed to add an addtional fix [11:47:24] godog: (after lunch) bookworm rsyslog client reports ten (10) supported curves/finite-field groups in its ClientHello, bullseye server replies back with a shared curve group in its ServerHello [11:50:31] but I'd defer to Valentín if/when we are going to dig into rsyslog source code and OpenSSL library functions [11:51:39] do we notice a correlation based on OS (e.g. pairs of bullseye clients and bookworm servers?) or function (only ms-fe?) [11:52:05] Southparkfan: I guess that's expected behavior... the server picks the preferred curve from the ones available to the client (or refuses to complete the TLS handshake) [11:52:26] also need to take into account how much time to spend here vs. actually adding config that adds secure defaults [11:52:59] adding config providing a solid TLS configuration for rsyslog isn't optional IMHO [11:53:10] vgutierrez: yeah, so it does work as expected in the average case, but my Cloud VPS syslog server also reports errors [11:53:13] but I'm the TLS crazy guy [11:53:28] but doesn't provide much information on which client it failed for [11:53:42] the 'vs.' does indicate mutual exclusion :-) [11:54:22] I absolutely agree secure defaults are needed, I was wondering if we should aim straight for adding the defaults or if we should debug further first [11:58:15] a quick check, openssl ecparam -list_curves | sort |md5sum yields the same output on bookworm and bullseye [12:00:08] That makes sense [12:03:52] Have to go now, but unless rsyslog empties the list of supported curves client-side... [12:04:25] (why would you) [12:09:49] https://github.com/rsyslog/rsyslog/blob/ef1bd6bc03849a22110a6e54abbdb9573bf7b1b6/runtime/nsd_ossl.c#L1487-L1501 [12:10:30] OPENSSL_VERSION_NUMBER >= 0x10002000L is 1.0.2, so SSL_CTX_set_ecdh_auto is being used [12:11:47] the error comes from osslPostHandshakeCheck: https://github.com/rsyslog/rsyslog/blob/ef1bd6bc03849a22110a6e54abbdb9573bf7b1b6/runtime/nsd_ossl.c#L1615 [12:15:05] on OpenSSL 3 manpage... SSL_CTX_set_ecdh_auto() and SSL_set_ecdh_auto() are deprecated and have no effect. [12:15:20] it looks like rsylog could use a patch [12:15:25] *rsyslog [12:20:33] so no specific curve selection is happening for OpenSSL 3 (rsyslog for bookworm) [12:20:52] enforcing it via configuration should get rid of that [12:30:23] vgutierrez: nice find. What about bullseye? Is openssl 1.1.xy still the default there? [12:31:20] Yeah.. no openssl 3 [12:31:53] 1.1.1w at the moment [12:33:25] Aha [12:33:45] If we cannot reproduce the errors on bullseye->bullseye, this must be the cause [12:34:28] It's... interesting [12:39:59] (PuppetFailure) firing: Puppet has failed on prometheus1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:48:48] jbond: ack thank you [12:49:05] Southparkfan: ack too, thank you and vgutierrez for the investigation so far! [14:30:45] https://github.com/rsyslog/rsyslog/blame/ef1bd6bc03849a22110a6e54abbdb9573bf7b1b6/runtime/nsd_ossl.c#L1487-L1501 SSL_CTX_set_ecdh_auto is used if openssl >= 1.0.2 and <= 1.1.1o, SSL_CTX_set_tmp_ecdh is used if openssl < 1.0.2 [14:34:00] buster ships with 1.1.1n, bullseye with 1.1.1w and bookworm with 3.0.11. so the latter two don't use SSL_CTX_set_ecdh_auto, on 3.0.11 the function is deprecated regardless, but I'm not sure what happens if you remove the call on 1.1.1w [14:34:26] godog: (cc herron ) re the change set to move rsyslog to pki [14:34:38] I recall seeing these errors on bullseye clients as well... [14:35:22] i think i shuld plan a time when either one or both of yu are around to deploy this change (alternativly yu deploy while im around) [14:35:52] it would be good to get tone today tomorrow but let me know if thats to tight a timeline [14:36:35] jbond: thank you for the heads up, early next week works best for me tbh [14:36:50] which will mean herro.n is around too, added bonus [14:37:31] * jbond checks to make sure jobo is not here [14:38:10] godog: ack that works, im not strickly ment to be doing changes next week but i dont see that lasting so lets aim for monday/tuesday at a time that is good for herron [14:38:11] Southparkfan: quite interesting, can't say I understand what's going on heh, or what is supposed to [14:38:22] jbond: SGTM, thank you [14:38:41] ill send out an invite but feel free to suggest alternates [14:38:49] no worries godog, I've never used the library either :) [14:40:34] heheh [14:41:02] * Southparkfan still has nightmares regarding his exam questions on Galois fields' purpose in TLS [14:41:09] jbond: I'm doing some work in an adjacent area btw, and I've sent some patches your way, there shouldn't be clashes though with your work and I'm happy to hold off too [14:44:34] Southparkfan: makes you want to run all across those fields and far away [14:44:39] that's their purpose [14:44:43] xD [14:46:11] godog: ack ill tak a proper look in a bit but from a scan i agree they do not conflict [14:47:45] cheers, I'm about to jump into meetings for a couple of hours [14:48:58] using two bullseye systems (one client, one server, absolutely no other remote dests configured on either system) is also enough to emit the error [14:57:00] SSL_get_shared_curve itself is also legacy, ha [15:14:25] ok, I have no time anymore to dig further, but the pcaps don't match the errors, and it's not an openssl 3 (bookworm)-only issue [15:37:05] Southparkfan: SSL_get_negotiated_group() is the new one [16:40:14] (PuppetFailure) firing: Puppet has failed on prometheus1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:40:14] (PuppetFailure) firing: Puppet has failed on prometheus1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:43:20] ^Looking [20:49:19] The alert is expected since Puppet is disabled on that host. Acknowledging it to cut down on notifications. More details at https://gerrit.wikimedia.org/r/c/operations/puppet/+/976273.