[00:36:35] PROBLEM - MariaDB sustained replica lag on s4 on db1248 is CRITICAL: 7.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1248&var-port=9104 [00:36:57] PROBLEM - MariaDB sustained replica lag on s4 on db1221 is CRITICAL: 3 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1221&var-port=9104 [00:42:51] RECOVERY - MariaDB sustained replica lag on s4 on db1221 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1221&var-port=9104 [00:46:59] RECOVERY - MariaDB sustained replica lag on s4 on db1248 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1248&var-port=9104 [01:03:37] PROBLEM - MariaDB sustained replica lag on s4 on db1221 is CRITICAL: 4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1221&var-port=9104 [01:12:31] RECOVERY - MariaDB sustained replica lag on s4 on db1221 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1221&var-port=9104 [02:41:55] PROBLEM - MariaDB sustained replica lag on s1 on db2146 is CRITICAL: 155.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2146&var-port=9104 [02:46:23] RECOVERY - MariaDB sustained replica lag on s1 on db2146 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2146&var-port=9104 [04:37:15] PROBLEM - MariaDB sustained replica lag on s4 on db1221 is CRITICAL: 3.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1221&var-port=9104 [04:47:39] RECOVERY - MariaDB sustained replica lag on s4 on db1221 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1221&var-port=9104 [10:15:03] PROBLEM - MariaDB sustained replica lag on s4 on db1221 is CRITICAL: 3 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1221&var-port=9104 [10:18:01] RECOVERY - MariaDB sustained replica lag on s4 on db1221 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1221&var-port=9104 [11:50:20] ahoyhoy - just a heads-up: something weird might be going on with restbase2028 - there's been two OOMkills of cassandra instances there in the last day. It has recovered okay and I couldn't see anything exceptional in the logs that might have caused the extra load but it might be worth keeping an eye on [14:44:49] hnowlan: see https://phabricator.wikimedia.org/T353456 [14:45:32] any ideas there would be appreciated. basically, it's only the nodes that went up recently as part of the refresh (Dell r450s), and —thus far— only nodes in row B [14:45:57] though I paused the refresh one node short of the intended 3 in row C [14:47:40] the SAS controller coughs up some resets that correspond, but I don't know if that is cause, or effect. I found some old(ish) kernel bugs that seemed to suggest those resets happen after periods of intense i/o, so maybe it thrashes a bit before the oom and those a red herring(?) [14:50:26] if it's Cassandra leaking memory, then it's leaking direct memory (otherwise we'd see the JVM would oom), and again, it only affects these three machines; everything is running the same JVM, Cassandra, and Debian versions. [14:57:44] urandom: ohhh interesting, I'll keep having a dig and see if anything shows up. there are some of those sas controller errors on 2028 but not really around the time of the oomkill [14:58:18] Ok, I confess that after the first seven, I stopped looking [14:58:35] (we're up to nine now)