[07:27:02] <jinxer-wm>	 (SystemdUnitFailed) firing: (17) wmf_auto_restart_prometheus-mysqld-exporter@s2.service on db2197:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:52:02] <jinxer-wm>	 (SystemdUnitFailed) firing: export_smart_data_dump.service on dbprov1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:21:50] <marostegui>	 arnaudb: wasn't db2215 already cloned?
[08:22:34] <arnaudb>	 I was not sure of its state given the last issue so I figured better both retry the cookbook to see if the hiccup came from that or if it was a blip
[08:22:55] <marostegui>	 The hiccup had nothing to do with those hosts, but withthe intermediate master
[08:23:03] <arnaudb>	 oh
[08:23:03] <marostegui>	 Maybe I wasn't clear on the task comment
[08:23:53] <arnaudb>	 I understood that it came from semi sync but was not sure it was from the master or a missed event from the replica
[08:24:26] <marostegui>	 yeah no, no data was lost or anything
[08:24:27] <arnaudb>	 anyway, it'll be pooling by eod!
[08:24:34] <marostegui>	 yeah no problem, all good
[08:24:36] <marostegui>	 better be safe
[08:47:02] <jinxer-wm>	 (SystemdUnitFailed) resolved: export_smart_data_dump.service on dbprov1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:56:47] <urandom>	 Soo... I get a notification for aqs2001 this morning, but even according to icinga, it has been down for more than 4 days
[14:58:14] <urandom>	 actually... it seemed to show that the cql ports had been down for 4 days, the rest of it was "unknown"
[14:58:29] <urandom>	 now it shows down (as of ~3 hours ago)
[14:58:49] <urandom>	 nothing shows on alerts.w.o though, even still
[15:00:28] <urandom>	 but, it's kind of disconcerting to have something down for so long without knowing about it
[15:01:13] <urandom>	 I even made a point to go through icinga.w.o/alerts and alerts.w.o yesterday
[17:03:15] <nemo-yiannis>	 urandom: Hey, I picked up the work for cassandra+PCS on staging
[17:03:34] <nemo-yiannis>	 Are these already created on cassandra dev? https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/mobileapps/+/refs/heads/master/scripts/schema/cassandra_schema.cql
[17:29:10] <urandom>	 nemo-yiannis: yeah, it's ready to go
[19:06:28] <nemo-yiannis>	 thanks! i will test things tomorrow on staging
[23:16:09] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s4 on db1243 is CRITICAL: 5 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1243&var-port=9104
[23:17:09] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s4 on db1243 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1243&var-port=9104