[01:09:59] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db2132 is CRITICAL: 8.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104
[01:10:21] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 6.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[01:11:53] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db2132 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104
[01:12:15] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[05:14:01] <marostegui>	 jynus: Regarding https://phabricator.wikimedia.org/T331510 I forgot there was a time change...so when I wrote 9AM UTC I really meant 10AM Spanish time :)
[05:14:05] <marostegui>	 Is that still ok with you?
[07:01:16] <jynus>	 marostegui: just let me know in advance when you want to do it and I will stop the services
[07:01:28] <marostegui>	 jynus: In 1h :)
[07:01:57] <marostegui>	 If you could review this too https://gerrit.wikimedia.org/r/c/operations/puppet/+/902572 that'd be helpful!
[07:02:17] <jynus>	 yes, I had a look at it last week but got distracted
[07:02:24] <marostegui>	 no problem :)
[07:22:06] <jynus>	 bacula, however, completed its daily backups
[07:26:40] <jynus>	 one thing I see is that puppet says binlog_format: ROW, but I see binlog_format MIXED in both
[07:26:51] <marostegui>	 yeah, the usual thing
[07:31:04] <jynus>	 I see you already moved the topology
[07:32:13] <marostegui>	 yeah
[07:34:13] <marostegui>	 jynus: if everything from your side is good to go, I can do the failover now, no need to wait anylonger
[07:34:17] <jynus>	 I will shutdown bacula, as I don't expect any backup or recovery in the next 30 minutes
[07:34:22] <marostegui>	 sweet
[07:34:27] <marostegui>	 let me know when ready
[07:34:30] <jynus>	 marostegui: I would prefer if we waited a bit
[07:34:38] <marostegui>	 absolutely, let me know
[07:34:40] <jynus>	 as dbs are running a bit late 
[07:34:46] <marostegui>	 no problem
[07:35:06] <jynus>	 and even if they have gathered almost all metadata, that way it require no manual intervention from me (saves me time)
[07:35:24] <jynus>	 *requires
[07:37:21] <jynus>	 one thing I can do is to reload configuration more frequently for dbbackups or if connection fails
[07:37:32] <jynus>	 for next time
[07:41:09] <jynus>	 I will meanwhile prepare another patch for bacula
[07:41:17] <marostegui>	 cool
[08:16:54] <jynus>	 I need to do some changes on dbbackups hosts and restart bacula
[08:17:04] <marostegui>	 cool thanks
[08:24:30] <marostegui>	 Amir1 jynus can you double check if you are allowed to remove the -2 from here? https://gerrit.wikimedia.org/r/c/operations/puppet/+/903182 as I won't be around :)
[08:24:57] <marostegui>	 if not, I will leave it with -1 so it can be removed/ignored the day of the switchover
[08:25:08] <Amir1>	 I just did :P
[08:25:14] <marostegui>	 Great
[08:25:15] <jynus>	 allowed as in, if I have permissions on gerrit?
[08:25:19] <marostegui>	 Leave the -2 again then
[08:25:33] <marostegui>	 yeah, I didn't know if it was allowed or not
[08:25:34] <jynus>	 when should that happen?
[08:25:53] <marostegui>	 https://phabricator.wikimedia.org/T333123#8727976
[08:25:53] <jynus>	 can you send an invite?
[08:25:57] <marostegui>	 I will give more details on the meeting
[08:26:04] <jynus>	 ok, that works too
[08:26:29] <jynus>	 as maintenance could be rescheduled or something
[08:26:31] <marostegui>	 But essentially whenever you and Amir1 want after the row B maintenance and before 4th april (row C maintenance). But I won't be around those days :)
[08:27:30] <jynus>	 dbbackups are ok now, restarting bacula
[08:29:47] <jynus>	 I need to restart the monitoring daemons, they got in a weird state
[08:30:07] <jynus>	 and retun gerrit backups
[08:32:42] <jynus>	 I'm not sure also if I did https://gerrit.wikimedia.org/r/c/operations/puppet/+/903179/1/modules/profile/files/backup/job_monitoring_ignorelist right
[08:36:27] <jynus>	 ok, back to normal state, finally: "RECOVERY - Backup freshness on backup1001 is OK: Fresh: 125 jobs"
[08:37:03] <jynus>	 and gerrit backup ran correctly, all done regarding maintenance, marostegui
[08:37:10] <marostegui>	 thanks jynus!
[09:02:28] <jynus>	 I've sped up swift backups at codfw, but noticed a spike on 503s at swift  (previous to my change)
[09:11:24] * Emperor tempted to depool ms-fe2009
[09:18:23] <Amir1>	 marostegui: okay if I do switchovers of s4 and s7 in eqiad?
[09:58:44] <marostegui>	 Amir1: Sure, up to you. Remember that eqiad is now active for reads
[09:59:01] <Amir1>	 yeah, it's now basically codfw in ordinary days
[09:59:10] <marostegui>	 yep
[10:03:38] <Emperor>	 !log depool ms-fe2009
[10:03:39] <stashbot>	 Emperor: Not expecting to hear !log here
[10:03:41] <Emperor>	 bah
[12:02:17] <marostegui>	 Amir1: are you joining the meeting?
[12:02:27] <Amir1>	 right now
[12:02:30] <marostegui>	 thanks
[12:08:35] <jynus>	 marostegui: my fault!
[12:24:57] * marostegui stares at jynus 
[12:46:31] <jinxer-wm>	 (SystemdUnitFailed) firing: swift_rclone_sync.service Failed on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:51:32] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) wmf_auto_restart_prometheus-mysqld-exporter@s7.service Failed on db1101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:51:46] <marostegui>	 ^ fixing that
[12:56:31] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) wmf_auto_restart_prometheus-mysqld-exporter@s7.service Failed on db1101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:26:31] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter@s7.service Failed on db1101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:44:11] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@s7.service Failed on db1101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:46:01] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@s7.service Failed on db1101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:54:11] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@s7.service Failed on db1101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:03:27] <marostegui>	 I don't get those alerts. The systemd unit is disabled and reset
[20:56:01] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@s7.service Failed on db1101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:29:11] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) wmf_auto_restart_prometheus-mysqld-exporter@s7.service Failed on db1101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:40:36] <Amir1>	 sigh, I'm about to cry
[22:30:40] <Emperor>	 are they new alerts set up during sprint week? Should we turn them down a bit?