[07:50:52] what is going on with s1 codfw? [08:15:14] I was performing a switchover [08:15:24] ah ok [08:27:04] (SystemdUnitFailed) firing: (12) wmf_auto_restart_prometheus-mysqld-exporter@s2.service on db2197:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:32:04] (SystemdUnitFailed) firing: (12) wmf_auto_restart_prometheus-mysqld-exporter@s2.service on db2197:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:42:04] (SystemdUnitFailed) firing: (13) wmf_auto_restart_prometheus-mysqld-exporter@s2.service on db2197:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:57:04] (SystemdUnitFailed) firing: (10) prometheus-mysqld-exporter.service on db2198:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:08:05] I have no idea where those errors come from, if I check for failed systemd units it shows 0 [09:11:55] it looks like a confirmation that T357333 is needed haha [09:11:56] T357333: SystemdUnitFailed alerts are too noisy for data-persistence - https://phabricator.wikimedia.org/T357333 [09:12:24] jynus: they were renamed with @s7 and @s8 [09:12:39] $ journalctl -u prometheus-mysqld-exporter.service | tail -n1 [09:12:42] Mar 05 15:41:26 db2198 systemd[1]: Failed to start prometheus-mysqld-exporter.service - Prometheus exporter for MySQL server. [09:13:25] so I guess something in our puppettization that doesn't properly cleanup everything when migrating from the single instance to multi-instance for prometheus-mysqld-exporter [09:14:11] volans: it is not that- they failed at some point, but they are not failing currently [09:14:20] they don't exists currently [09:14:22] so why do they fire? [09:15:47] there is no prometheus-mysqld-exporter.service unit right now AFAICT [09:16:21] so why is the monitoring saiting "SystemdUnitFailed" ? [09:16:46] is it checking the logs rather than the current state? [09:16:55] that's based on prometheus metrics [09:17:23] do you happen to know which exporter or service? [09:17:53] mmmh I don't see them in alerts.w.o [09:18:48] weird, I saw them just a few seconds ago [09:18:58] which was what was confusing me [09:19:48] sorry, dunno, alertmanager is still too confusing fo me [09:20:28] I think we need to do better systemd monitoring and the whole restart systemd [09:20:48] (wmf_auto_restart) [09:25:07] for now, alertname!=SystemdUnitFailed on alertmanager 🙈 [12:01:19] mysqld config generator has a runtime issue, hotfixing [12:07:05] if somebody is up for a quick +1: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1017456?tab=checks :) [12:08:57] thanks kormat ! [12:09:06] arnaudb: my pleasure :) [12:09:37] that's probably about the level of complexity i can handle right now; i need to remember how to puppet etc :) [12:10:32] :D [12:15:02] kormat: maybe it is time for you to come back to the bash world? [12:15:35] * kormat screams &>/dev/null [13:30:54] my laptop just crashed [13:30:58] incoming [13:31:11] get a mac [13:34:32] it's my targus dock that makes the system crash :-( [14:02:05] marostegui: summary of potential schema change gaps: April 4 - April 8 for s2, s6, x1 on db2197; April 3 - April 4 for s7, s8 on db2198 [14:02:40] counting from backup consistency to the time they were added into zarcillo [14:03:29] But I will CC you when productionize them just to keep the info centalized on gerrit [14:12:07] Thanks [14:22:02] marostegui: jynus: on my end T360332 was the only schema update of the last week and I've checked s8, s7, s6. s5 and s4 are WIP, so I'll run anyway a --check before going on to the next sections [14:22:02] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [15:40:41] kormat: o/ [15:41:46] urandom: o/ (if you have time) - I wanted to check https://grafana.wikimedia.org/d/000000418/cassandra for AQS but afaics the metrics are not there [15:41:51] is it known ? [15:44:59] ah ok it is only a problem of cluster selector [15:45:09] it is not picked up, but if I enter "aqs" manually it works [15:45:09] elukey: I've noticed this before —but not consistently— [15:45:18] yeah, you have to input the cluster name [15:45:59] it seems to work sometimes though [15:46:08] an wait I see [15:46:12] cassandra_table_readlatency{keyspace="system",table="peers"} is not there for AQS [15:46:21] uh...really? [15:46:49] the metric I mean [15:47:16] but there is with cassandra_table_readlatency{keyspace="system"} [15:47:39] for the purpose of the dashboard I think it is fine to change it, would it be ok? [15:48:18] I assume so, yeah [15:48:28] but I wonder why that's not there [15:48:28] maybe it loads too many things, checking if one is quicker [15:48:50] ah no ok it seems that "table" is not there [15:49:00] the table totally is [15:49:07] yes yes I meant the label [15:49:07] what about peers_v2? [15:49:44] urandom: https://w.wiki/9hjT [15:49:55] but just using the keyspace should work for the dashboard variable [15:50:30] something is weird with the metrics, they all don't have the table label [15:51:59] yeah. [15:54:19] something different about the datasource config maybe, because everything else should be identical [15:54:59] I mean wrt collection and storage [15:55:17] on aqs1010 I see this from the prometheus exporter [15:55:18] cassandra_table_readlatency_75p{table="peers",keyspace="system",} 0.0 [15:55:47] that looks sane [15:56:01] yes yes but take a look to the metrics pasted above [15:56:06] it doesn't have the _75p [15:56:33] even if I see the percentiles for other clusters [15:57:24] right, what I mean is, it must be something external to the clusters themseles [15:57:26] themselves [15:57:41] something in the collection, processing, and storage [15:57:56] because the exporters are all configured the same, and the output looks right [15:58:05] so session store have both cassandra_table_readlatency and cassandra_table_readlatency_XXp [15:58:15] directly from the exporter on the node I mean [15:58:46] weird.. [15:59:15] oh, I think I see what you're saying [15:59:38] ok sorry correction, totally my bad [15:59:49] there is also the non-percentile metric, but for peers_v2, as you asked above [16:02:02] nope, even cassandra_table_readlatency{table="peers",keyspace="system",} [16:02:25] at this point it must be the prometheus analytics instance [16:03:01] yeah, that's what meant about "datasource" (using the grafana parlance) [16:15:12] urandom: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1017882 - this is probably the reason [16:38:47] elukey: ooooh [16:39:23] heh "past Luca" [16:39:32] "a previous version of Luca..." [16:49:20] urandom: Luca from the past, better, I'll amend [16:49:41] or "in the past" [16:50:07] I'll just say "I" [16:50:11] more professional [16:50:36] thanks for the review! [16:50:49] I'll deploy tomorrow when the DE folks review it [16:53:01] I liked it before! [16:53:14] :D [16:54:01] going afk, new truststore deployed to aqs-codfw, tomorrow I'll do eqiad [16:54:07] all good so far! [16:58:47] elukey: thanks!