[08:50:28] <jynus>	 I am doing an IO stress test on backup2010- if I go overboard and crash it/fill up its disks don't worry, it is not currenly pooled on any service
[09:09:34] <Amir1>	 s5 	eqiad 	snapshot 	wrong_size 	4 hours ago 	490.9 GB 	-11.5 % 	The previous backup had a size of 554.9 GB, a change larger than 5.0%. 
[09:11:13] <jynus>	 going for a coffee while I keep hitting backup2010
[09:32:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-ethtool-exporter.service on backup2010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:37:48] <Emperor>	 jynus: maybe downtiming that host would be good while you stress it?
[09:52:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: prometheus-dpkg-success-textfile.service on backup2010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:04:37] <jynus>	 Emperor: it had all alerts disabled
[10:05:22] <jynus>	 But I don't even know where alerms like that come from ^
[10:10:00] <Emperor>	 that's the catch-all alertmanager alert for a systemd unit failing
[10:11:03] <Emperor>	 if you go to https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed you can see them (and I think silence them for this host for a suitable period)
[10:12:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: prometheus-dpkg-success-textfile.service on backup2010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:14:11] <jynus>	 I will just restart the host to make sure it boots correctly
[10:17:25] <jinxer-wm>	 RESOLVED: [3x] SystemdUnitFailed: prometheus-dpkg-success-textfile.service on backup2010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:17:23] <marostegui>	 change_tag in wikidata is such a pain to operate with Query OK, 940052169 rows affected (9 hours 36 min 53.983 sec)
[15:50:03] <jynus>	 I'm doing a huge deletion on backup1-eqiad too, ignore any lag (despite downtiming alerts there)
[15:50:45] <dhinus>	 marostegui: I tried to put the flows we discussed yesterday in a diagram, could you please do a quick review? https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas#Overview_diagram
[15:58:31] <marostegui>	 dhinus: I am about to log off for the day, I will get back to you tomorrow
[16:02:52] <dhinus>	 no rush!