[01:07:05] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter@x1.service on db2101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:07:05] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter@x1.service on db2101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:29:44] I've updated T357333 to note that we're back to email & IRC pings every 4 hours again [07:29:45] T357333: SystemdUnitFailed alerts are too noisy for data-persistence - https://phabricator.wikimedia.org/T357333 [09:07:05] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter@x1.service on db2101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:07:18] so what is up with that host? [09:07:55] https://phabricator.wikimedia.org/T362311 [09:08:12] should be decommissioned in 30 minutes or so [09:08:38] ah excellent [09:10:00] I am waiting to confirm this is gone: https://alerts.wikimedia.org/?q=alertname%3Dsnapshot%20of%20s5%20in%20codfw&q=team%3Dsre&q=%40receiver%3Dirc-spam [09:14:14] jynus: btw you might get an alert of s4 eqiad backups being smaller as I deployed a schema change to categorylinks which has reduced the table quite a bit [09:14:30] interesting, thanks [09:14:54] I've not seen it so far despite last backups finishing 3h ago [09:15:20] or maybe it will only be on dumps? [11:49:12] (SystemdUnitFailed) firing: (3) wmf_auto_restart_prometheus-mysqld-exporter@x1.service on db2101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:02:52] merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1019733 to fix ^ [12:06:07] I am going to deploy that new grant to m1 primary [15:02:09] Emperor: are the thanos-be disk alerts... known? [15:03:05] urandom: o/ cassandra-dev on pki : [15:03:06] :) [15:03:43] I saw! \o/ [15:04:10] so... we should probably follow this up with some testing of kask [15:04:16] I think that now we could try this - if we flip the enablehostver flag we should still get an error, since the cabundle is not right yet [15:04:27] to double check that also the cert is verified [15:04:51] ah yes and see if kask works now [15:05:25] right, I was just going to say: 0) verify it currently works, 1) enable verification, and expect failure. 2) fix the cert chain, expect success ?? [15:07:45] (0) checks out btw [15:12:33] elukey: what is the right way to do (2) above? [15:13:06] urandom: in theory simply changing the cabundle with the one brought by wmf-certificates [15:13:19] (we should check if the kask docker images have it) [15:14:10] urandom: disk-nearly-full alerts? [15:14:16] right, I meant, are we adding that to main_app.certs.cassandra.ca, or chaning the path? [15:14:19] Emperor: yes. [15:14:42] s/chaning/chainging/ [15:15:28] urandom: yeah, that's T351927 [15:15:29] T351927: Decide and tweak Thanos retention - https://phabricator.wikimedia.org/T351927 [15:16:07] urandom: I'd say changing the path directly, seems easier, we don't need to support the old CA anymore [15:18:17] gotcha, so once the migration is complete, we can refactor the chart to remove all of that [15:18:31] yes yes exactly [15:22:12] arnaudb: db2170 is not multiinstance but in zarcillo it's marked as such I think [15:22:26] I think there are a couple more hosts like this too [15:22:31] oh [15:22:41] db2167 [15:27:03] seems like an artifact, will clean [15:39:51] I think db2171:3315 and db2137:3315 are similar too [15:40:17] db2101:3315 [15:41:09] Amir1: https://phabricator.wikimedia.org/T362311 [15:41:35] I think you are returning hosts that fail to connect [15:41:43] which can be for several reasons [15:41:56] yeah, I'm looking for multiinstance ones [15:43:28] elukey: `{"msg":"Error connecting to Cassandra: gocql: unable to create session: unable to discover protocol version: x509: certificate signed by unknown authority","appname":"sessionstore","time":"2024-04-15T15:42:16Z","level":"FATAL"}` [15:43:42] (sorry for the wrong channel ping btw) [15:44:27] urandom: perfect! [15:45:03] I checked and wmf-certificates needs to be added to kask's docker image [15:45:17] but so far I didn't manage to send a merge request to the kask repo in gitlab :D [15:45:46] as-in there is a problem, or you just haven't gotten there yet? [15:46:24] the latter :D [15:46:27] my commit is https://gitlab.wikimedia.org/elukey/kask/-/compare/main...main?from_project_id=1312 [15:46:31] wmf-certificates the debian package, right? [15:46:33] tested locally and the image builds [15:46:40] right exactly [15:47:02] Amir1: I'm sorry I don't see it for db2167, and db2170 on `instances` db2101 has been fixed, will move on to db2171 and db2137 [15:47:30] oh cool, probably something else [15:47:44] trying to find a reference to `port` in other tables [15:48:24] urandom: what is the right process? Commit in the fork, and open a merge request? [15:48:31] otherwise feel free to do it [15:49:26] but db2101 is a multiinstance host [15:49:48] arnaudb ^ [15:50:03] and it is not a core host [15:50:11] 2171* [15:50:21] oh wait [15:50:31] let me revert [15:51:11] Amir1: you are confusing people [15:52:05] (SystemdUnitFailed) firing: (3) wmf_auto_restart_prometheus-mysqld-exporter@x1.service on db2101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:52:40] I think people.discovery.wmnet is serving the old file [15:52:47] let me check again [15:52:48] db2170 [15:53:10] elukey: I pushed it to main from your branch, kind of surprised it didn't kick off the pipeline tho [15:53:27] presumably it will when I push a tag for it [15:53:30] db2101 was always a multiinstance host, backup source [15:54:11] I always use the version tagged result when updating the chart though [15:54:12] yeah, I was saying, it wasn't responsive, we needed to check [15:55:32] urandom: totally ignorant about gitlab :( [15:55:41] elukey: mostly so myself [15:55:57] especially when it comes to this sort of thing [15:58:23] and yeah, that started the pipeline: https://gitlab.wikimedia.org/repos/mediawiki/services/kask/-/pipelines/48791 [16:12:55] super [16:47:40] elukey: to confirm, /usr/share/ca-certificates/wikimedia/Wikimedia_Internal_Root_CA.crt is the right path? [16:48:18] urandom: this one is better /etc/ssl/certs/wmf-ca-certificates.crt [16:48:34] it contains the bundle, and it gets regenerated etc.. [16:49:02] perfect, thanks [16:50:07] (going afk, will check later in case, have a nice rest of the day folks!) [16:51:03] elukey: yup, thanks and enjoy the remainder of your day [16:52:15] jynus: arnaudb the reason everything was confusing was that people has not been failed over to eqiad, I uploaded the file to codfw and the results were updated finally [16:52:29] https://wikitech.wikimedia.org/wiki/People.wikimedia.org [16:52:39] (the documentation is lying) [16:52:52] weird [19:52:05] (SystemdUnitFailed) firing: (2) prometheus-mysqld-exporter.service on db2199:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:59:12] (SystemdUnitFailed) firing: (2) prometheus-mysqld-exporter.service on db2199:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed