[01:07:05] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter@x1.service on db2101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:07:05] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter@x1.service on db2101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:29:44] <Emperor>	 I've updated T357333 to note that we're back to email & IRC pings every 4 hours again
[07:29:45] <stashbot>	 T357333: SystemdUnitFailed alerts are too noisy for data-persistence - https://phabricator.wikimedia.org/T357333
[09:07:05] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter@x1.service on db2101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:07:18] <marostegui>	 so what is up with that host? 
[09:07:55] <jynus>	 https://phabricator.wikimedia.org/T362311
[09:08:12] <jynus>	 should be decommissioned in 30 minutes or so
[09:08:38] <marostegui>	 ah excellent
[09:10:00] <jynus>	 I am waiting to confirm this is gone: https://alerts.wikimedia.org/?q=alertname%3Dsnapshot%20of%20s5%20in%20codfw&q=team%3Dsre&q=%40receiver%3Dirc-spam
[09:14:14] <marostegui>	 jynus: btw you might get an alert of s4 eqiad backups being smaller as I deployed a schema change to categorylinks which has reduced the table quite a bit
[09:14:30] <jynus>	 interesting, thanks
[09:14:54] <jynus>	 I've not seen it so far despite last backups finishing 3h ago
[09:15:20] <jynus>	 or maybe it will only be on dumps?
[11:49:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) wmf_auto_restart_prometheus-mysqld-exporter@x1.service on db2101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:02:52] <jynus>	 merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1019733 to fix ^
[12:06:07] <jynus>	 I am going to deploy that new grant to m1 primary
[15:02:09] <urandom>	 Emperor: are the thanos-be disk alerts... known?
[15:03:05] <elukey>	 urandom: o/ cassandra-dev on pki :
[15:03:06] <elukey>	 :)
[15:03:43] <urandom>	 I saw! \o/
[15:04:10] <urandom>	 so... we should probably follow this up with some testing of kask
[15:04:16] <elukey>	 I think that now we could try this - if we flip the enablehostver flag we should still get an error, since the cabundle is not right yet
[15:04:27] <elukey>	 to double check that also the cert is verified
[15:04:51] <elukey>	 ah yes and see if kask works now
[15:05:25] <urandom>	 right, I was just going to say: 0) verify it currently works, 1) enable verification, and expect failure.  2) fix the cert chain, expect success ??
[15:07:45] <urandom>	 (0) checks out btw
[15:12:33] <urandom>	 elukey: what is the right way to do (2) above?
[15:13:06] <elukey>	 urandom: in theory simply changing the cabundle with the one brought by wmf-certificates
[15:13:19] <elukey>	 (we should check if the kask docker images have it)
[15:14:10] <Emperor>	 urandom: disk-nearly-full alerts?
[15:14:16] <urandom>	 right, I meant, are we adding that to main_app.certs.cassandra.ca, or chaning the path?
[15:14:19] <urandom>	 Emperor: yes.
[15:14:42] <urandom>	 s/chaning/chainging/
[15:15:28] <Emperor>	 urandom: yeah, that's T351927 
[15:15:29] <stashbot>	 T351927: Decide and tweak Thanos retention - https://phabricator.wikimedia.org/T351927
[15:16:07] <elukey>	 urandom: I'd say changing the path directly, seems easier, we don't need to support the old CA anymore
[15:18:17] <urandom>	 gotcha, so once the migration is complete, we can refactor the chart to remove all of that
[15:18:31] <elukey>	 yes yes exactly
[15:22:12] <Amir1>	 arnaudb: db2170 is not multiinstance but in zarcillo it's marked as such I think
[15:22:26] <Amir1>	 I think there are a couple more hosts like this too
[15:22:31] <arnaudb>	 oh
[15:22:41] <Amir1>	 db2167
[15:27:03] <arnaudb>	 seems like an artifact, will clean
[15:39:51] <Amir1>	 I think db2171:3315 and db2137:3315 are similar too
[15:40:17] <Amir1>	 db2101:3315
[15:41:09] <jynus>	 Amir1: https://phabricator.wikimedia.org/T362311
[15:41:35] <jynus>	 I think you are returning hosts that fail to connect
[15:41:43] <jynus>	 which can be for several reasons
[15:41:56] <Amir1>	 yeah, I'm looking for multiinstance ones
[15:43:28] <urandom>	 elukey: `{"msg":"Error connecting to Cassandra: gocql: unable to create session: unable to discover protocol version: x509: certificate signed by unknown authority","appname":"sessionstore","time":"2024-04-15T15:42:16Z","level":"FATAL"}`
[15:43:42] <urandom>	 (sorry for the wrong channel ping btw)
[15:44:27] <elukey>	 urandom: perfect!
[15:45:03] <elukey>	 I checked and wmf-certificates needs to be added to kask's docker image
[15:45:17] <elukey>	 but so far I didn't manage to send a merge request to the kask repo in gitlab :D
[15:45:46] <urandom>	 as-in there is a problem, or you just haven't gotten there yet?
[15:46:24] <elukey>	 the latter :D
[15:46:27] <elukey>	 my commit is https://gitlab.wikimedia.org/elukey/kask/-/compare/main...main?from_project_id=1312
[15:46:31] <urandom>	 wmf-certificates the debian package, right?
[15:46:33] <elukey>	 tested locally and the image builds
[15:46:40] <elukey>	 right exactly
[15:47:02] <arnaudb>	 Amir1: I'm sorry I don't see it for db2167, and db2170 on `instances` db2101 has been fixed, will move on to db2171 and db2137
[15:47:30] <Amir1>	 oh cool, probably something else
[15:47:44] <arnaudb>	 trying to find a reference to `port` in other tables
[15:48:24] <elukey>	 urandom: what is the right process? Commit in the fork, and open a merge request?
[15:48:31] <elukey>	 otherwise feel free to do it
[15:49:26] <jynus>	 but db2101 is a multiinstance host
[15:49:48] <jynus>	 arnaudb ^
[15:50:03] <jynus>	 and it is not a core host
[15:50:11] <arnaudb>	 2171*
[15:50:21] <arnaudb>	 oh wait
[15:50:31] <arnaudb>	 let me revert
[15:51:11] <jynus>	 Amir1: you are confusing people
[15:52:05] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) wmf_auto_restart_prometheus-mysqld-exporter@x1.service on db2101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:52:40] <Amir1>	 I think people.discovery.wmnet is serving the old file
[15:52:47] <Amir1>	 let me check again
[15:52:48] <Amir1>	 db2170
[15:53:10] <urandom>	 elukey: I pushed it to main from your branch, kind of surprised it didn't kick off the pipeline tho
[15:53:27] <urandom>	 presumably it will when I push a tag for it
[15:53:30] <jynus>	 db2101 was always a multiinstance host, backup source
[15:54:11] <urandom>	 I always use the version tagged result when updating the chart though
[15:54:12] <Amir1>	 yeah, I was saying, it wasn't responsive, we needed to check
[15:55:32] <elukey>	 urandom: totally ignorant about gitlab :(
[15:55:41] <urandom>	 elukey: mostly so myself
[15:55:57] <urandom>	 especially when it comes to this sort of thing
[15:58:23] <urandom>	 and yeah, that started the pipeline: https://gitlab.wikimedia.org/repos/mediawiki/services/kask/-/pipelines/48791
[16:12:55] <elukey>	 super
[16:47:40] <urandom>	 elukey: to confirm, /usr/share/ca-certificates/wikimedia/Wikimedia_Internal_Root_CA.crt is the right path?
[16:48:18] <elukey>	 urandom: this one is better /etc/ssl/certs/wmf-ca-certificates.crt
[16:48:34] <elukey>	 it contains the bundle, and it gets regenerated etc..
[16:49:02] <urandom>	 perfect, thanks
[16:50:07] <elukey>	 (going afk, will check later in case, have a nice rest of the day folks!)
[16:51:03] <urandom>	 elukey: yup, thanks and enjoy the remainder of your day
[16:52:15] <Amir1>	 jynus: arnaudb the reason everything was confusing was that people has not been failed over to eqiad, I uploaded the file to codfw and the results were updated finally
[16:52:29] <Amir1>	 https://wikitech.wikimedia.org/wiki/People.wikimedia.org
[16:52:39] <Amir1>	 (the documentation is lying)
[16:52:52] <jynus>	 weird
[19:52:05] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) prometheus-mysqld-exporter.service on db2199:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:59:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) prometheus-mysqld-exporter.service on db2199:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed