[01:09:59] PROBLEM - MariaDB sustained replica lag on m1 on db2132 is CRITICAL: 8.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [01:10:21] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 6.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:11:53] RECOVERY - MariaDB sustained replica lag on m1 on db2132 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [01:12:15] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [05:14:01] jynus: Regarding https://phabricator.wikimedia.org/T331510 I forgot there was a time change...so when I wrote 9AM UTC I really meant 10AM Spanish time :) [05:14:05] Is that still ok with you? [07:01:16] marostegui: just let me know in advance when you want to do it and I will stop the services [07:01:28] jynus: In 1h :) [07:01:57] If you could review this too https://gerrit.wikimedia.org/r/c/operations/puppet/+/902572 that'd be helpful! [07:02:17] yes, I had a look at it last week but got distracted [07:02:24] no problem :) [07:22:06] bacula, however, completed its daily backups [07:26:40] one thing I see is that puppet says binlog_format: ROW, but I see binlog_format MIXED in both [07:26:51] yeah, the usual thing [07:31:04] I see you already moved the topology [07:32:13] yeah [07:34:13] jynus: if everything from your side is good to go, I can do the failover now, no need to wait anylonger [07:34:17] I will shutdown bacula, as I don't expect any backup or recovery in the next 30 minutes [07:34:22] sweet [07:34:27] let me know when ready [07:34:30] marostegui: I would prefer if we waited a bit [07:34:38] absolutely, let me know [07:34:40] as dbs are running a bit late [07:34:46] no problem [07:35:06] and even if they have gathered almost all metadata, that way it require no manual intervention from me (saves me time) [07:35:24] *requires [07:37:21] one thing I can do is to reload configuration more frequently for dbbackups or if connection fails [07:37:32] for next time [07:41:09] I will meanwhile prepare another patch for bacula [07:41:17] cool [08:16:54] I need to do some changes on dbbackups hosts and restart bacula [08:17:04] cool thanks [08:24:30] Amir1 jynus can you double check if you are allowed to remove the -2 from here? https://gerrit.wikimedia.org/r/c/operations/puppet/+/903182 as I won't be around :) [08:24:57] if not, I will leave it with -1 so it can be removed/ignored the day of the switchover [08:25:08] I just did :P [08:25:14] Great [08:25:15] allowed as in, if I have permissions on gerrit? [08:25:19] Leave the -2 again then [08:25:33] yeah, I didn't know if it was allowed or not [08:25:34] when should that happen? [08:25:53] https://phabricator.wikimedia.org/T333123#8727976 [08:25:53] can you send an invite? [08:25:57] I will give more details on the meeting [08:26:04] ok, that works too [08:26:29] as maintenance could be rescheduled or something [08:26:31] But essentially whenever you and Amir1 want after the row B maintenance and before 4th april (row C maintenance). But I won't be around those days :) [08:27:30] dbbackups are ok now, restarting bacula [08:29:47] I need to restart the monitoring daemons, they got in a weird state [08:30:07] and retun gerrit backups [08:32:42] I'm not sure also if I did https://gerrit.wikimedia.org/r/c/operations/puppet/+/903179/1/modules/profile/files/backup/job_monitoring_ignorelist right [08:36:27] ok, back to normal state, finally: "RECOVERY - Backup freshness on backup1001 is OK: Fresh: 125 jobs" [08:37:03] and gerrit backup ran correctly, all done regarding maintenance, marostegui [08:37:10] thanks jynus! [09:02:28] I've sped up swift backups at codfw, but noticed a spike on 503s at swift (previous to my change) [09:11:24] * Emperor tempted to depool ms-fe2009 [09:18:23] marostegui: okay if I do switchovers of s4 and s7 in eqiad? [09:58:44] Amir1: Sure, up to you. Remember that eqiad is now active for reads [09:59:01] yeah, it's now basically codfw in ordinary days [09:59:10] yep [10:03:38] !log depool ms-fe2009 [10:03:39] Emperor: Not expecting to hear !log here [10:03:41] bah [12:02:17] Amir1: are you joining the meeting? [12:02:27] right now [12:02:30] thanks [12:08:35] marostegui: my fault! [12:24:57] * marostegui stares at jynus [12:46:31] (SystemdUnitFailed) firing: swift_rclone_sync.service Failed on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:51:32] (SystemdUnitFailed) firing: (3) wmf_auto_restart_prometheus-mysqld-exporter@s7.service Failed on db1101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:51:46] ^ fixing that [12:56:31] (SystemdUnitFailed) firing: (3) wmf_auto_restart_prometheus-mysqld-exporter@s7.service Failed on db1101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:26:31] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter@s7.service Failed on db1101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:44:11] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@s7.service Failed on db1101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:46:01] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@s7.service Failed on db1101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:54:11] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@s7.service Failed on db1101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:03:27] I don't get those alerts. The systemd unit is disabled and reset [20:56:01] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@s7.service Failed on db1101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:29:11] (SystemdUnitFailed) firing: (3) wmf_auto_restart_prometheus-mysqld-exporter@s7.service Failed on db1101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:40:36] sigh, I'm about to cry [22:30:40] are they new alerts set up during sprint week? Should we turn them down a bit?