[01:08:11] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 5 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:08:33] PROBLEM - MariaDB sustained replica lag on m1 on db1217 is CRITICAL: 16.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321 [01:11:23] RECOVERY - MariaDB sustained replica lag on m1 on db1217 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321 [01:11:31] PROBLEM - MariaDB sustained replica lag on m1 on db2132 is CRITICAL: 23 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [01:12:57] RECOVERY - MariaDB sustained replica lag on m1 on db2132 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [01:15:19] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [05:55:06] jynus: I have merged and upgraded db1225 https://gerrit.wikimedia.org/r/c/operations/puppet/+/942652 [05:56:24] Amir1: The switchmaster tool is giving 500s, could you double check? Thanks! [05:57:36] Sure. Give me a min [05:58:31] Np [05:58:34] I am sanitizing the new wikis [06:02:11] marostegui: It's because it can't find candidate master of s6 in puppet [06:02:31] how come? [06:02:42] # db1173 [06:02:42] # candidate master for s6 [06:05:13] hmm, from what I'm seeing, it should find it, let me dig [06:07:33] marostegui: it worked now I didn't do anything :D [06:07:41] XD [06:07:44] classic [06:07:47] Thanks :) [06:08:13] if it's not pooled in during the call, it might cause this issue btw [06:08:23] ah maybe [06:08:29] that could have been it [06:31:05] s4 eqiad snapshot wrong_size 25 minutes ago 1.6 TB -11.0 % The previous backup had a size of 1.8 TB, a change larger than 5.0%. [06:31:15] -201.5GB [07:17:13] https://logstash.wikimedia.org/goto/c4a3338667b1b2bc4fd236dab2e837bf [07:17:19] is it me or the graph has stopped drawing? [07:20:03] maybe it is just me, but as long as you don't expect it write future queries looks fine to me :-D [07:20:40] sometimes there are lag, however [07:21:11] oh hahaha I didn't realise I put "from now" XD [07:21:39] ha ha [07:22:25] don't worry, we will ask arnaudb to write the feature to write all future queries :-D [07:23:05] 😱 [07:23:34] I mean- logging old queries is easy [07:25:40] Step 1: setup new kibana dashboard on a server. Step 2: acceleterate server beyond the light speed. 😜 [07:53:37] :smi [07:53:52] 😄 I'm not used to irccloud's autocomplete :D [08:49:02] https://phabricator.wikimedia.org/P52022 sigh (disk swap, then puppet merrily did the wrong thing, which I will now fix) [09:19:49] marostegui: I don't remember if we did x1 reboots automatically too. Any objection to running the script there as well? [09:19:57] works for me [09:20:07] awesome [14:10:13] why is db1218 depooled? [14:15:08] last update seems to be https://phabricator.wikimedia.org/P49603 so probably hw servicing and forgotten afterwards? [14:16:34] According to SAL it looks like a maintenance from 28th aug [14:16:38] Amir1: was that a kernel reboot? [14:17:07] oh, then I was wrong [14:17:26] let me check [14:18:53] I haven't depooled it explicitly [14:19:05] at least not according to my bash history [14:19:20] 22:53 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance [14:19:44] that can be either the extlinks clean up or reboots [14:19:57] both downtime for a day and both were ran recently there [14:20:02] either way, could you double check if it still needs to be depooled? [14:20:54] yeah, it was extlinks [14:36:49] marostegui: I'm running compare stuff in a screen on s4 of codfw. I need to be afk for a bit. If there are slowdowns or such, feel free to kill it [14:37:21] it's one replica to another replica, so it should be quite safe but you never know [14:37:36] ok [15:39:24] db1201 just went down [15:39:25] I have depooled it [15:39:44] that was quick, I was just coming from the p.age :) [15:40:52] https://phabricator.wikimedia.org/T345271 [15:43:38] I think it might be network related, the host is reachable via idrac but not from anywhere else [15:43:43] Going to tag eqiad DCOps [15:44:46] [167720.577345] tg3 0000:04:00.0 eno1: Link is down [15:44:48] Yep