[00:30:10] PROBLEM - MariaDB sustained replica lag on s8 on db2167 is CRITICAL: 43.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2167&var-port=9104 [00:32:12] RECOVERY - MariaDB sustained replica lag on s8 on db2167 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2167&var-port=9104 [01:17:08] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:17:08] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:47:42] Is there a task for the repooling cookbook bug? [06:34:16] I just made es2022 master of es4 [08:59:41] marostegui: which bug? [08:59:54] volans: I just added you to an email thread [09:00:44] ahh that bug, ok [09:01:28] the solution for that is to add to dbctl a way to get the diff dict instead of doing reverse engineering of a diff text output in the cookbook [09:02:02] volans: Is this https://phabricator.wikimedia.org/T380194 ? [09:02:04] and all this because json doesn't allow to put commas after the last item in a list/obj [09:02:12] yes [09:02:22] Good, I saw it earlier and I was happy seeing that task [09:03:32] now fo a short term fix we could just ask the operator what to do, but we were reluctant as the pool/repool are probably meant to be used also as part of other automation [09:03:57] yeah, exactly [09:05:02] I'm not sure if there can be cases of depool that show that issue too [09:05:44] given we're on the topic... I'd like to release spicerack 9.0.0 today that has the unification of the mysql modules, and I'd need a ~30 minutes "freeze" for cookbooks related to the mysql or mysql_legacy module. [09:05:59] Is today a good day? what could be a good time? cc Amir1 arnaudb [09:06:02] volans: I am fine with that, let's see if the others are [09:07:19] 100% ok on my end :) [09:17:08] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:31:49] ^ anyone checked this host? [09:36:39] not yet, will get to it after paging stuff [09:38:26] sounds good thanks, if not I will try to get to it later when I am doing with external storage too [09:39:07] ack, first one to get to it notifies the other :) [09:41:51] sounds good! [10:00:13] volans: go for it [10:01:27] marostegui: this has been paging so I looked it a bit yesterday. replication is working fine so it's not mysql itself, the promethues exporter is not happy [10:04:42] Amir1: what do you mean paging? [10:05:01] sorry, not paging, need coffee, alerting [10:05:28] yeah, I saw it yesterday too, but I didn't get to it [10:05:38] Amir1: did you get to see what the issue was with the exporter? [10:05:53] no [10:06:04] just made sure prod db is okay [10:06:04] Amir1: Ok I will check later if arnaudb doesn't do it before me [10:06:10] thank you! [10:07:33] errr [10:07:40] 34d 1h 25m 9s 3/3 PROCS CRITICAL: 0 processes with command name 'mysqld' [10:07:46] 34 days with no mysql? [10:08:03] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1084145 [10:08:07] is this host in production or not? [10:08:55] Ok, so according to this https://phabricator.wikimedia.org/T373579 the changes are merged but jaime would get to it when he's back, which will happen in January [10:09:52] Can we just revert and leave it on insetup? Otherwise we have a host showing up on icinga but not really in production at all [10:10:17] It is also on instances.yaml, and backup sources would never get dbctl, so we need to remove it from there [10:47:57] marostegui, Amir1, arnaudb: ok then, proceeding, please hold off mysql-related cookbooks for few minutes [10:48:08] volans: thanks [10:49:17] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1100068 marostegui here is the fix [11:07:28] Followed up on the patch [11:08:15] marostegui, Amir1, arnaudb: it should be all done, from my quick dry-run tests it seems all good, you can resume normal runs. But please let me know if you encounter any strange behaviour [11:08:49] thanks volans! [11:08:55] if you have some runs that you or me can use as test even better (like depool/pool, clone, upgrade) [11:23:17] Thanks! [11:47:08] RESOLVED: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:49:13] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:05:28] ^ I have also removed it from zarcillo (as that is used by the prometheus exporter) [13:05:53] testing https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1063167 on codfw nodes to restart sanitarium instances (not the hosts) - in case it breaks something, please shout! [15:52:08] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:00:48] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 69 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [19:06:48] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [19:50:49] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 153.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [19:57:48] FIRING: MysqlReplicationLag: MySQL instance db1206:9104@s1 has too large replication lag (8m 28s). Its replication source is db1163.eqiad.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1206&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [19:59:48] FIRING: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db1206:9104 has too large replication lag (9m 21s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1206&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [20:54:48] RESOLVED: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db1206:9104 has too large replication lag (1m 42s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1206&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [20:57:48] RESOLVED: MysqlReplicationLag: MySQL instance db1206:9104@s1 has too large replication lag (1m 42s). Its replication source is db1163.eqiad.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1206&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [20:59:51] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 4.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [21:04:51] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 30.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [21:07:51] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 2.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [22:33:51] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 12.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [22:35:51] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 2.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [23:44:20] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 130 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [23:52:22] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104