[01:19:32] <jinxer-wm>	 (MysqlReplicationLag) firing: MySQL instance db2139:13313 has too large replication lag (6m 58s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2139&var-port=13313 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag
[03:49:32] <jinxer-wm>	 (MysqlReplicationLag) resolved: MySQL instance db2139:13313 has too large replication lag (14m 17s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2139&var-port=13313 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag
[06:58:32] <Amir1>	 marostegui: morning, T312863 is only left on codfw master of s1, I assume we need to do switchover for the reboot as well. Is it fine I just do it? no user impact given codfw is read-only anyway
[06:58:32] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[08:43:03] <marostegui>	 Amir1: Sure, I was planning to take care of that today or tomorrow, but up to you :)
[08:43:40] <Amir1>	 sure, let me see what I can do. 
[08:44:09] <marostegui>	 I can take care of that
[08:44:15] <marostegui>	 You have plenty of stuff :)
[08:47:10] <jynus>	 db1132 seems to be shot of space:  	DISK WARNING - free space: / 3216 MB (9% inode=97%): /tmp 3216 MB (9% inode=97%): /var/tmp 3216 MB (9% inode=97%):
[08:47:40] <marostegui>	 uh...interesting
[08:47:42] <marostegui>	 I will take care of that
[08:47:45] <marostegui>	 Thanks for the heads up
[08:48:02] <jynus>	 sorry for pinging on a warning, but better doing it early than later
[08:48:05] <marostegui>	 fixed :)
[08:48:13] <marostegui>	 it was the core generates for the debugging 
[08:50:34] <jynus>	 there are a few other disk alerts that could be relevant to this team: ms-be2035, aqs1004
[08:52:12] * Emperor just put in a CR to remove ms-be20[28-39] from the rings :)
[08:55:49] <Amir1>	 marostegui: Thanks <3
[08:56:45] <Emperor>	 ms-be2035 is presumably T314509 which is being "resolved" by shuffling it off its mortal coil
[08:56:45] <stashbot>	 T314509: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509
[08:56:46] <jynus>	 while debugging what finally was a train issue, I ran into a long running source of mediawiki replication errors, FYI: https://logstash.wikimedia.org/goto/a822d07342bb4833ada55f3501ae1f67
[09:23:17] <jynus>	 but if you see no errors that looks as working as expected
[09:27:12] <jynus>	 sorry, ignore that last line, it was for other channel
[09:28:11] <jynus>	 I will file a bug for Chronology protector unless you see trivial error at DB layer
[10:02:47] <jynus>	 URGENT BUG: percentage sign in backup monitoring alert needs to be prefixed by a non-breakable space to avoid "-7.0\n%"
[10:22:57] <jynus>	 I filed https://phabricator.wikimedia.org/T317625
[10:50:28] <Amir1>	 marostegui: okay if I run the schema change on the old s1 codfw master?
[10:51:10] <marostegui>	 Amir1: not yet
[10:54:42] <Amir1>	 let me know
[11:07:31] <marostegui>	 Amir1: db2103 is fully yours (it is pooled)
[11:07:39] <Amir1>	 Thanks!
[11:07:53] <marostegui>	 We have again a bunch of hosts depooled, are they meant to be depooled?
[11:08:11] <Amir1>	 marostegui: we have a lot of schema changes going on
[11:08:19] <marostegui>	 ok as long as they are under control, that's ok
[11:08:29] <Amir1>	 noting that I now run schema changes in parallel in one section too (one per dc)
[11:08:44] <marostegui>	 yeah, but we have 2 from s7 in codfw for instance
[11:09:12] <marostegui>	 Ah no, they are from eqiad and codfw
[11:09:21] <Amir1>	 yeah, that's expected
[11:17:50] <Amir1>	 I love how the templatelinks drop in s4 is like a sharp hammer dropping 210GB: https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=28&orgId=1&var-server=db1147&var-datasource=thanos&var-cluster=wmcs&from=now-24h&to=now 
[11:17:50] <Amir1>	 vs in s3 it's slow decrease because it's just lots of small wikis https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=28&orgId=1&var-server=db2109&var-datasource=thanos&var-cluster=wmcs&from=now-12h&to=now
[11:17:50] <Amir1>	 still 100GB in s3 is gone too
[12:27:11] <marostegui>	 I am going to reboot db2093, which is zarcillo's slave
[12:27:18] <marostegui>	 I don't think it is used for anything but just fyi
[12:27:37] <marostegui>	 It used to be orchestrator master,but that's no longer the case
[14:45:55] <jynus>	 so to recap- I was matching the last alert with the fact that some misc read only alerts were paging, but maybe shouldn't, I saw it some weeks ago but forgot to ask
[21:05:32] <jinxer-wm>	 (MysqlReplicationLag) firing: (2) MySQL instance db1139:13311 has too large replication lag (11m 24s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica  - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag
[21:10:32] <jinxer-wm>	 (MysqlReplicationLag) resolved: (2) MySQL instance db1139:13311 has too large replication lag (11m 24s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica  - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag
[21:20:32] <jinxer-wm>	 (MysqlReplicationLag) firing: (3) MySQL instance db1139:13311 has too large replication lag (21m 36s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica  - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag
[21:25:32] <jinxer-wm>	 (MysqlReplicationLag) resolved: (2) MySQL instance db1150:13314 has too large replication lag (9m 50s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica  - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag