[06:54:46] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter@s6.service Failed on db2194:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:13:50] (SystemdUnitFailed) firing: (3) prometheus-mysqld-exporter.service Failed on db2194:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:15:02] is someone working on db2194? [08:16:41] yes, will mute it [08:16:45] sorry about the noise :) [08:19:47] I'm not sure we really need every 4h email & IRC alerts for a systemd unit failed... [08:20:12] (but that's a wider discussion) [09:58:41] s5 (codfw): Size change (bytes) -55.2 GB (-7.8 %) [10:12:17] niice [10:12:32] I start dropping the columns from s4 today, that'd be fun [14:57:47] the last feature of the new iteration of the mariadb clone cookbook has been tested after a few patchsets, given the patch size: thanks to anybody contributing to the review https://w.wiki/99VS [15:05:09] my timing sucks, but if anyone has the cycles to sanity-check https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/998538, it would be appreciated! [15:09:35] 👀 [17:07:48] apropos the meeting discussion, would a phab task for systemdunitfailed be better than an email? [17:15:18] personally, I'd prefer to have non-useful IRC logs than more phab tickets being left open forever [17:25:09] OK, I've opened T357333 to ask our friends on observability for advice [17:25:21] T357333: SystemdUnitFailed alerts are too noisy for data-persistence - https://phabricator.wikimedia.org/T357333 [17:25:41] thank you [17:27:48] (PuppetZeroResources) firing: Puppet has failed generate resources on db1140:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [17:42:49] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on db1135:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [17:52:49] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on db1135:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [17:53:42] arnaudb: probably needs downtime? ^ [17:54:10] jynus: sorry to bother, a sanity check for something I want to do but Manuel is out. [17:54:35] pagelinks in enwikinews is partitioned, to remove paritions, doing "ALTER TABLE tbl REMOVE PARTITIONING;" should be fine? [17:54:48] https://www.irccloud.com/pastebin/P8RLV3mh/ [17:56:02] (ofc without replication) [17:57:49] (PuppetZeroResources) firing: (7) Puppet has failed generate resources on db1135:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [18:07:49] (PuppetZeroResources) firing: (7) Puppet has failed generate resources on db1133:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [18:42:44] arnaudb: I've downtimed db1133 for 13 hours (so to about 08:30 UTC tomorrow) [18:44:13] (looks like there's >1 host affected, so should I do more?) [18:45:23] yeah, let's downtime them [20:36:22] (SessionStoreOnNonDedicatedHost) firing: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [20:46:25] sad_trombone.wav [22:08:03] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on db1135:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources