[01:19:32] (MysqlReplicationLag) firing: MySQL instance db2139:13313 has too large replication lag (6m 58s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2139&var-port=13313 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [03:49:32] (MysqlReplicationLag) resolved: MySQL instance db2139:13313 has too large replication lag (14m 17s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2139&var-port=13313 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [06:58:32] marostegui: morning, T312863 is only left on codfw master of s1, I assume we need to do switchover for the reboot as well. Is it fine I just do it? no user impact given codfw is read-only anyway [06:58:32] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [08:43:03] Amir1: Sure, I was planning to take care of that today or tomorrow, but up to you :) [08:43:40] sure, let me see what I can do. [08:44:09] I can take care of that [08:44:15] You have plenty of stuff :) [08:47:10] db1132 seems to be shot of space: DISK WARNING - free space: / 3216 MB (9% inode=97%): /tmp 3216 MB (9% inode=97%): /var/tmp 3216 MB (9% inode=97%): [08:47:40] uh...interesting [08:47:42] I will take care of that [08:47:45] Thanks for the heads up [08:48:02] sorry for pinging on a warning, but better doing it early than later [08:48:05] fixed :) [08:48:13] it was the core generates for the debugging [08:50:34] there are a few other disk alerts that could be relevant to this team: ms-be2035, aqs1004 [08:52:12] * Emperor just put in a CR to remove ms-be20[28-39] from the rings :) [08:55:49] marostegui: Thanks <3 [08:56:45] ms-be2035 is presumably T314509 which is being "resolved" by shuffling it off its mortal coil [08:56:45] T314509: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 [08:56:46] while debugging what finally was a train issue, I ran into a long running source of mediawiki replication errors, FYI: https://logstash.wikimedia.org/goto/a822d07342bb4833ada55f3501ae1f67 [09:23:17] but if you see no errors that looks as working as expected [09:27:12] sorry, ignore that last line, it was for other channel [09:28:11] I will file a bug for Chronology protector unless you see trivial error at DB layer [10:02:47] URGENT BUG: percentage sign in backup monitoring alert needs to be prefixed by a non-breakable space to avoid "-7.0\n%" [10:22:57] I filed https://phabricator.wikimedia.org/T317625 [10:50:28] marostegui: okay if I run the schema change on the old s1 codfw master? [10:51:10] Amir1: not yet [10:54:42] let me know [11:07:31] Amir1: db2103 is fully yours (it is pooled) [11:07:39] Thanks! [11:07:53] We have again a bunch of hosts depooled, are they meant to be depooled? [11:08:11] marostegui: we have a lot of schema changes going on [11:08:19] ok as long as they are under control, that's ok [11:08:29] noting that I now run schema changes in parallel in one section too (one per dc) [11:08:44] yeah, but we have 2 from s7 in codfw for instance [11:09:12] Ah no, they are from eqiad and codfw [11:09:21] yeah, that's expected [11:17:50] I love how the templatelinks drop in s4 is like a sharp hammer dropping 210GB: https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=28&orgId=1&var-server=db1147&var-datasource=thanos&var-cluster=wmcs&from=now-24h&to=now [11:17:50] vs in s3 it's slow decrease because it's just lots of small wikis https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=28&orgId=1&var-server=db2109&var-datasource=thanos&var-cluster=wmcs&from=now-12h&to=now [11:17:50] still 100GB in s3 is gone too [12:27:11] I am going to reboot db2093, which is zarcillo's slave [12:27:18] I don't think it is used for anything but just fyi [12:27:37] It used to be orchestrator master,but that's no longer the case [14:45:55] so to recap- I was matching the last alert with the fact that some misc read only alerts were paging, but maybe shouldn't, I saw it some weeks ago but forgot to ask [21:05:32] (MysqlReplicationLag) firing: (2) MySQL instance db1139:13311 has too large replication lag (11m 24s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [21:10:32] (MysqlReplicationLag) resolved: (2) MySQL instance db1139:13311 has too large replication lag (11m 24s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [21:20:32] (MysqlReplicationLag) firing: (3) MySQL instance db1139:13311 has too large replication lag (21m 36s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [21:25:32] (MysqlReplicationLag) resolved: (2) MySQL instance db1150:13314 has too large replication lag (9m 50s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag