[07:27:02] (SystemdUnitFailed) firing: (17) wmf_auto_restart_prometheus-mysqld-exporter@s2.service on db2197:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:52:02] (SystemdUnitFailed) firing: export_smart_data_dump.service on dbprov1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:21:50] arnaudb: wasn't db2215 already cloned? [08:22:34] I was not sure of its state given the last issue so I figured better both retry the cookbook to see if the hiccup came from that or if it was a blip [08:22:55] The hiccup had nothing to do with those hosts, but withthe intermediate master [08:23:03] oh [08:23:03] Maybe I wasn't clear on the task comment [08:23:53] I understood that it came from semi sync but was not sure it was from the master or a missed event from the replica [08:24:26] yeah no, no data was lost or anything [08:24:27] anyway, it'll be pooling by eod! [08:24:34] yeah no problem, all good [08:24:36] better be safe [08:47:02] (SystemdUnitFailed) resolved: export_smart_data_dump.service on dbprov1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:56:47] Soo... I get a notification for aqs2001 this morning, but even according to icinga, it has been down for more than 4 days [14:58:14] actually... it seemed to show that the cql ports had been down for 4 days, the rest of it was "unknown" [14:58:29] now it shows down (as of ~3 hours ago) [14:58:49] nothing shows on alerts.w.o though, even still [15:00:28] but, it's kind of disconcerting to have something down for so long without knowing about it [15:01:13] I even made a point to go through icinga.w.o/alerts and alerts.w.o yesterday [17:03:15] urandom: Hey, I picked up the work for cassandra+PCS on staging [17:03:34] Are these already created on cassandra dev? https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/mobileapps/+/refs/heads/master/scripts/schema/cassandra_schema.cql [17:29:10] nemo-yiannis: yeah, it's ready to go [19:06:28] thanks! i will test things tomorrow on staging [23:16:09] PROBLEM - MariaDB sustained replica lag on s4 on db1243 is CRITICAL: 5 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1243&var-port=9104 [23:17:09] RECOVERY - MariaDB sustained replica lag on s4 on db1243 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1243&var-port=9104