[08:11:29] <_joe_> see #_security, we seemingly had trouble on ES again tonight. I'm not very happy it's the 4th time there's some trouble on ES in the last month. [11:35:28] PROBLEM - MariaDB sustained replica lag on s1 on db2212 is CRITICAL: 47.25 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2212&var-port=9104 [11:36:00] it's ok [11:36:28] RECOVERY - MariaDB sustained replica lag on s1 on db2212 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2212&var-port=9104 [12:28:43] PROBLEM - MariaDB sustained replica lag on s3 on db2205 is CRITICAL: 31.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2205&var-port=9104 [12:29:43] RECOVERY - MariaDB sustained replica lag on s3 on db2205 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2205&var-port=9104 [13:15:22] Could I get a +1 to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1130592 please? adding ms-be2089 to the rings [13:25:45] Emperor: ✅ [13:26:16] thanks :) [14:00:16] PROBLEM - MariaDB sustained replica lag on x1 on db2196 is CRITICAL: 100.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2196&var-port=9104 [14:05:16] RECOVERY - MariaDB sustained replica lag on x1 on db2196 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2196&var-port=9104 [14:09:54] to make cookbooks more consistent do we prefer `--task` or `--task-id` ? Also hostnames or fqdns or both? [14:11:19] the next release (already in master) of spicerack will allow to automatically add -t/--task-id and -r/--reason [14:11:59] so I'd suggest to stick with --task-id for an easier transition [14:12:18] I am fine either way: honestly, "-t" is just faster [14:12:19] I used that one as the one used in most if not all cookbooks that have a task argument [15:50:21] Amir1: ms2 is configured, can I deploy the dbctl config too? [15:50:30] sounds good [15:50:33] I can monitor [15:50:40] Amir1: But I can then? [15:51:01] the moment you commit, it'll start getting traffic and it should be fine [15:51:13] ms3 has traffic? [15:51:15] ms3 is already getting traffic [15:51:18] yup [15:51:20] ok! [15:51:23] I did it to warm it up [15:55:56] Amir1: I am going to pool ms2 [15:56:07] let's go [15:56:19] pushed [15:56:30] I see connections [15:58:02] Amir1: How does it look from your end? [15:58:03] I see connections being removed from other sections too [15:58:16] (it should go down to two third) [15:58:54] marostegui: a lot of errors [15:58:58] nice [15:58:58] for servers being RO [15:59:00] which ones? [15:59:02] ah right [15:59:03] fixing [15:59:09] https://logstash.wikimedia.org/goto/7c7e59eadf52d9117802b15a5b58a9f5 [15:59:17] fixed [15:59:22] It is the default when we restart hosts [15:59:34] They should be gone [16:00:00] 131 errors, it's practically zero in error budget [16:00:14] errors gone to zero now [16:00:57] let's not depool ms1 for now [16:01:00] Let's give it a day [16:01:06] sure [16:01:17] I update the docs and such after my meeting [16:01:36] ok, I am going to review all zarcillo entries to make sure it is all good [17:18:31] I got hoisted by my own petard [17:18:46] https://usercontent.irccloud-cdn.com/file/rMYiLYTP/image.png [17:18:48] for a thumbnail [18:36:48] FIRING: MysqlReplicationLagPtHeartbeat: MySQL instance db2204:9104 has too large replication lag (12h 9m 43s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2204&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [20:43:18] ^ this was an issue with pt-heartbeat which I just solved [20:46:48] RESOLVED: MysqlReplicationLagPtHeartbeat: MySQL instance db2204:9104 has too large replication lag (14h 15m 53s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2204&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat