[00:07:41] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 40.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [00:15:43] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [00:35:47] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 13 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [00:36:47] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [01:34:13] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2043:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:31:06] m3 master has been switched over [05:37:08] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2043:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:39:13] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2043:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:52:08] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2043:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:11:33] <_joe_> Amir1: when you wake up, I think its worth considering the parsercache work and maybe stop it [06:11:41] <_joe_> for the fundraising season [06:11:54] <_joe_> we've had so many instabilities in PC lately :/ [08:10:06] <_joe_> I have taken a look at the overall PC status, I am more than slightly worried, esp about eqiad [08:11:08] <_joe_> it seems we're out of spares until pc1013 and pc1017 are repooled/fixed [08:11:13] <_joe_> am I missing somethng? [08:22:45] <_joe_> anyone? [08:23:12] I'm ready to attach pc1013 to a cluster as a spare, I suggest pc5 but I'm not sure enough about my script to run it without any dba support in case anything goes bad [08:24:09] pc1017 is pending a answer on a question to have the same fate [08:24:18] <_joe_> which question, sorry? [08:24:22] <_joe_> this has become quite urgent [08:25:13] on a shared document with Manuel → There is some bootstraping files missing and I'm lacking the information about the proper way to set them up [08:25:38] <_joe_> can you point me to the doc? I *might* be able to figure out what you need, I'm sure there's docs [08:25:40] both could be attach during the day [08:25:48] ack, thanks _joe_ will do :) let me send you the link [08:26:27] ah mybad, got my answer this morning [08:26:32] on it [08:26:41] (https://mariadb.com/kb/en/mariadb-install-db/ → the answer) [08:28:02] arnaudb: This all happened in ocotober when I was away, so saying this was blocked on me isn't very fair. Second of all, the script is still not in production and it is waiting to be moved to the normal repo as it was asked a few days ago [08:29:10] I was not saying it was blocked by you, sorry if it sounded like this → I was indeed waiting for an answer, which I had this morning [08:29:16] thank you for this [08:29:38] That totally sounded it is blocked on me. [08:30:03] ack [08:30:48] <_joe_> I assumed marostegui was resting given he's been working at 5 am... [08:31:01] <_joe_> marostegui: please take at leas the afternoon off :) [08:31:09] I will [09:14:13] FIRING: SystemdUnitFailed: pt-heartbeat-wikimedia.service on db1125:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:39:56] I'm waking up [09:40:14] _joe_: it is stopped since Monday, only thing happening is the backfill of new keys [09:40:55] and the backfill happens automatically, if someone visits and it's not parsed with the new key schema yet, it does it and stores it [09:41:05] no plan to make further changes until Jan [09:57:59] <_joe_> Amir1: ok, that's reassuring [09:58:04] <_joe_> or maybe not [10:00:04] reading the chat for the incident, loss of a pc node should not cause a major outage. It should fall back to the second one in line, that needs fixing now [10:00:07] <_joe_> maybe not because then this is organic traffic [10:00:10] <_joe_> yes [10:00:15] <_joe_> I added that to the doc already [10:01:00] I can give you details of my plan later [10:21:05] <_joe_> I [10:21:14] <_joe_> I'm looking at pc2015 [10:21:22] <_joe_> and there's clear signs of contention being the issue [10:21:28] <_joe_> but there's some peculiar things [10:21:43] <_joe_> 1) an increase of SELECT RANGE queries starting at 1 AM [10:21:56] <_joe_> correspodning to the increase in read rows I noticed [10:22:54] <_joe_> 2) the outage was caused by a contention on a innodb row, innodb_row_lock_waits spiked before the db became so unresponsive no data was gathered [10:25:06] <_joe_> but also, what does this mean? https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=mysql-parsercache&var-server=pc2015&var-port=9104&from=now-7d&to=now&viewPanel=20 Amir1 / arnaudb any idea? else I'll search the docs and look at the code [10:25:45] <_joe_> there was a huge spike of innodb_os_log_pending_writes two days ago, I have no idea what that actually means [10:26:19] <_joe_> ah comes straight from mysql [10:27:46] _joe_: the spike from two days ago (Monday?) is because of the deploy of the new key schema [10:28:13] that's three days ago though [10:28:15] <_joe_> no I mean the number of writes pending to the redo logs never recovered [10:28:38] ah, then I don't know [10:28:49] I investigate a bit more [10:28:59] <_joe_> and no, see the graph, the spike in that counter was yesterday at 4 pm [10:29:32] nothing on changes happened yesterday [10:30:17] <_joe_> yeah I don't think it's related [10:30:26] <_joe_> to your changes [10:31:53] I'm not seeing something similar in pc2014 https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=mysql-parsercache&var-server=pc2014&var-port=9104&from=now-7d&to=now [10:32:23] unless randomization in mediawiki is utterly broken, only thing can be bad config or hw issues in pc? [10:32:33] *pc2015 [10:38:38] <_joe_> it seems like the same counter value is on pc2011 and 2013 [10:38:46] <_joe_> so I guess it's kind of a broken metric [13:17:08] FIRING: SystemdUnitFailed: pt-heartbeat-wikimedia.service on db1125:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:08:28] this is test-s4, it probably shouldn't alert at all [17:12:08] FIRING: [2x] SystemdUnitFailed: pt-heartbeat-wikimedia.service on db1125:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:25:51] I downtimed it, but it shouldn't alert. [21:12:08] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2044:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed