[07:40:01] Amir1: did the wikishared alter finished? [07:40:06] the one on the massive table [07:40:30] marostegui: replicas of each DC yes [07:40:37] cool [07:40:53] Do we have any more schema changes pending or can I start scheduling a dc switch? [07:43:05] I can't find anything [07:43:18] we have another one that's also done on replicas of each dc [07:44:35] ok, so I can go ahead and start planning a dc switch [07:45:41] marostegui: for next week I'm hoping to get s1 and s3 out of the door (templatelinks), I can drop one of them [07:45:48] cool! [09:24:29] Can I get a review? https://gerrit.wikimedia.org/r/c/operations/puppet/+/827458 [09:24:33] Mostly IPs and ports [09:40:18] thanks jynus [09:42:41] jynus: ok to stop db1117:3323? I don't see any backup running there [09:42:55] it should be ok [09:42:59] ok! [09:43:02] ta [10:30:40] I am going to reboot zarcillo [10:35:54] done, orchestrator is back too [12:31:36] orchestrator seems broken [12:31:49] I guess it didn't like the db restart [12:32:13] ok, fixed now [12:39:16] Amir1: your script is also rebooting x1 hosts? [12:39:58] marostegui: haven't tried it but it should be able to [12:40:24] ok, just saying cause right now we are running STATEMENT there, so we need to make sure they come back as STATEMENT for now [12:40:26] I will change puppet [12:40:53] Ah. Yeah. Sgtm [12:42:46] marostegui: for the sake of record keeping. It's rebooting s2 eqiad now. S1 will follow [14:02:29] hey, amir, I have a question about https://gerrit.wikimedia.org/r/c/operations/alerts/+/825294/ if you have the time [14:03:40] jynus: I'm quickly grabbing lunch, I'll ping you soon [14:04:20] yeah, no hurries, we can talk later [14:04:30] (or tomorrow) [15:05:10] jynus: my meeting is now over, how can I help?! [15:06:46] I am wondering if https://gerrit.wikimedia.org/r/c/operations/alerts/+/825294/ just a proof of concept or it is meant to substitute the replication alerting? [15:07:17] jynus: it's the start of replacing the icinga alert [15:07:46] so my worry is that I see lots of regressions with the current system [15:08:19] I started it as warning but I will make it critical once we are sure it's working properly and adjust it as well (e.g. icinga has two alerts for different periods) [15:08:27] (not worried about prometheus, but about the prometheus exporter logic) [15:08:56] jynus: tell me more [15:09:15] I can give you an example (but not the only one) - If I stop replication, the prometheus alert will say nothing, while the current alert will track the lag [15:09:50] I see, yeah, sure, we need to make sure all of this is taken into account [15:10:00] it shouldn't be too hardTM to adapt them [15:10:35] the solution can be also different, e.g. for your case we can add an alert if there is no replication [15:10:35] yes, what I mean is the reason why I didn't touch the perl script is that there is very subtle behaviout incorporated over the years [15:10:49] but the prometheus exporter is unable to handle at all [15:11:05] not just a question of tuning the query [15:11:34] the prometheus exporter needs to be hacked or a separate exporter has to be coded [15:12:23] yeah but we need to eventually take the plug off from icinga so yeah, at least for easy cases we can make sure it's covered and then probably write an exporter or something like that [15:12:55] another issue is that the prometheus exporter fails frequently (less so after the patch from Emper*r) [15:13:02] that would need to page or alert in some way [15:13:17] plus paging depends on the mysql role, so that logic should also be built int [15:13:19] *in [15:14:43] hmm, yup. "fun". If you detail all of these in the ticket, I'd really appreciate it. [15:14:47] sure [15:15:05] my point is, I have absolutely nothing agains moving to prometheus [15:15:51] it is only going to be quite difficult- and you may not have been aware of those subtelties [15:16:21] no blocker for merging that, just for removing the icinga check [15:16:58] as sadly, prometheus stack is not as flexibile for custom metrics as icinga [15:17:33] this would be the main blocker: https://phabricator.wikimedia.org/T141968 [15:17:39] but I will comment the others on ticket [15:23:08] do you want me to change the dependencies, or should I just suggest them by text and you decide if to make it depend on those ticker or not? [15:24:48] nah, just a mention for now is enough, there are many parts of alerting that needs to be moved, this is a blocker for replication part, not all (other might have different blockers) [15:24:50] I will do the second for now [15:24:54] ok [22:15:06] PROBLEM - MariaDB sustained replica lag on s3 on db1175 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1175&var-port=9104 [22:19:30] PROBLEM - MariaDB sustained replica lag on s3 on db1179 is CRITICAL: 3.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1179&var-port=9104 [22:19:56] RECOVERY - MariaDB sustained replica lag on s3 on db1175 is OK: (C)2 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1175&var-port=9104 [22:23:46] PROBLEM - MariaDB sustained replica lag on s3 on db1166 is CRITICAL: 3 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1166&var-port=9104 [22:25:46] PROBLEM - MariaDB sustained replica lag on s3 on db1112 is CRITICAL: 2.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1112&var-port=9104 [22:30:40] PROBLEM - MariaDB sustained replica lag on s3 on db1112 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1112&var-port=9104 [22:31:04] PROBLEM - MariaDB sustained replica lag on s3 on db1166 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1166&var-port=9104 [22:33:28] RECOVERY - MariaDB sustained replica lag on s3 on db1166 is OK: (C)2 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1166&var-port=9104 [22:35:30] RECOVERY - MariaDB sustained replica lag on s3 on db1112 is OK: (C)2 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1112&var-port=9104 [22:36:32] PROBLEM - MariaDB sustained replica lag on s3 on db1179 is CRITICAL: 2.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1179&var-port=9104 [22:36:58] PROBLEM - MariaDB sustained replica lag on s3 on db1175 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1175&var-port=9104 [22:39:22] RECOVERY - MariaDB sustained replica lag on s3 on db1175 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1175&var-port=9104 [22:43:48] PROBLEM - MariaDB sustained replica lag on s3 on db1179 is CRITICAL: 4.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1179&var-port=9104 [22:53:28] RECOVERY - MariaDB sustained replica lag on s3 on db1179 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1179&var-port=9104 [23:07:48] PROBLEM - MariaDB sustained replica lag on s3 on db1179 is CRITICAL: 2.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1179&var-port=9104 [23:18:52] PROBLEM - MariaDB sustained replica lag on s3 on db1112 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1112&var-port=9104 [23:21:42] PROBLEM - MariaDB sustained replica lag on s3 on db1166 is CRITICAL: 2.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1166&var-port=9104 [23:22:20] RECOVERY - MariaDB sustained replica lag on s3 on db1179 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1179&var-port=9104 [23:23:44] RECOVERY - MariaDB sustained replica lag on s3 on db1112 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1112&var-port=9104 [23:29:00] PROBLEM - MariaDB sustained replica lag on s3 on db1166 is CRITICAL: 5.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1166&var-port=9104 [23:30:42] PROBLEM - MariaDB sustained replica lag on s3 on db1123 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1123&var-port=9104 [23:31:24] RECOVERY - MariaDB sustained replica lag on s3 on db1166 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1166&var-port=9104 [23:34:26] PROBLEM - MariaDB sustained replica lag on s3 on db1189 is CRITICAL: 3.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1189&var-port=9104 [23:36:52] RECOVERY - MariaDB sustained replica lag on s3 on db1189 is OK: (C)2 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1189&var-port=9104 [23:40:24] RECOVERY - MariaDB sustained replica lag on s3 on db1123 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1123&var-port=9104