[10:48:55] godog: do you have some time to talk about C:prometheus::blackbox::modules::service_catalog [10:53:38] jbond: sure, got about 40 mins before lunch [10:56:50] godog: cool , so im trying to make pki active active https://gerrit.wikimedia.org/r/c/operations/puppet/+/895758 [10:57:17] one of the problems is that the service uses client auth and the above resources dosen;t [10:57:31] i could add that however it seems overkill for this service (but say if not and ill just add it) [10:58:09] so instead i thought i would abuse the fact that the prom serveres can reach the serviuuce over port 80 for the metricts end point [10:58:37] as such i thought i would set up the service port to go there. however when i look at the output im not sure if it will actully check port 80 [10:58:51] output is here https://puppet-compiler.wmflabs.org/output/895758/40130/prometheus1005.eqiad.wmnet/index.html [10:59:36] jbond: yeah you are right, it'll probe the 'port' attribute from the catalog [10:59:51] I agree client auth seems overkill [11:00:03] is that the only thing the port attribute is used for, i.e. can i just update it to use port 80? [11:00:12] this is not an lvs services [11:01:59] I think in this case setting port: 80 would work yeah, I'm not aware of any other use [11:02:21] ack cool ill go with that fot now unless you have a better suggestion? [11:02:41] another solution (which I don't know how feasible it is to do) might be to exclude /metrics from client auth [11:03:09] which would also allow prometheus to fetch metrics over https with no futher configs [11:03:38] s/no further configs/require client auth in prometheus config/ [11:04:50] godog: i had problems doing thats as client auth on specific uri's requires different ssl semantics then doing it for the whole site (when its just established with the connection). [11:05:21] i have a feeling that the cfssl binary couldn;t deal with it/didn;t support it but details are hazey now [11:05:40] that said ill do a quiclk test to confirm [11:06:00] ok got it, thank you, a quick test SGTM [11:06:08] ack thanks [11:28:25] gotta go to lunch, ttyl [12:13:54] godog: fyi i looked into the mutual auth thing and seems my memeory was fairly close. i created a task here https://phabricator.wikimedia.org/T332149 however im not sure its worth the effort. but please feel free to coment and disagree :) [12:55:55] jbond: nice, thank you for taking a look! do you mind if I try sth on pki1001 ? I saw you were testing there [13:06:16] godog: yes please do, you can use /home/jbond/cfssl-test to try and request a certificate signing [13:12:17] jbond: thank you, will do [13:26:32] jbond: yeah I see the problem now, bummer :( [13:27:18] yes indeed its a PITA [13:27:38] +1'd the patch in the meantime, I left pki1001 alone [13:27:54] ack, thanks ill re-enable puppet now [19:53:29] yo herron, looks like we'll need something to stem kafka lag closer to to producers today :D [19:55:14] cwhite hmmmm fd [19:55:31] s/fd// [19:57:59] Seems codfw has half the throughput capacity as eqiad. I wonder if it's a limitation of our intra-dc links? [20:00:22] I'm also in the process of rolling eqiad kafka-logging distro upgrades today, could be related [20:02:26] herron: https://gerrit.wikimedia.org/r/c/operations/puppet/+/898916 [20:06:51] ah I'm with you now, +1'd [20:07:22] tried it local on a host, seems rsyslog doesn't like the \ character :/ [20:07:48] hmm awful escaping rings a bell [20:11:09] still doesn't like it :/ [20:15:05] ok, maybe this will work. checks out locally [20:15:56] * cwhite presses thumbs [20:18:41] cwhite: yeah lgtm too, loads up on a test host [20:29:02] cwhite: fwiw \\ at least does not break the config, not sure off hand if that matches as expected. also I am thinking we'll want to move the conditional inside the udp_localhost_to_kafka ruleset [20:30:35] yeah, it definitely isn't matching [20:31:05] is there a host you are using for testing? [20:31:25] mw1349 [20:32:39] alright if I try a change on 50-udp-localhost-compat.conf ? [20:32:55] please do :) [20:34:46] ok, I see events in /tmp/rdbms.log now [20:35:49] sure enough, let's expand it a bit [20:36:46] that works too. pushing a patch [20:37:18] kk [20:38:40] https://gerrit.wikimedia.org/r/c/operations/puppet/+/898917 [20:47:01] starting to see overall log throughput drop in eqiad [20:47:21] 😅 [20:47:29] waiting for the 📉 [20:50:26] How much do you think the kafka-logging maintenance affected codfw's ingest capacity? [21:01:31] at first I wanted to rule it out because I saw broker disconnects in the logstash logs, but those look normal and reconnected successfully. capacity wise I'd expect it to be negligible esp since eqiad isn't lagging, fwiw we perform one node at a time rolling maintenance probably once a quarter [21:01:55] its interesting how codw is lagging though. looks like the graphs are starting to level off [21:06:01] made a task to investigate that: https://phabricator.wikimedia.org/T332225 [21:06:17] nice, thank you [21:06:36] lots of things could be at play there :/