[08:50:28] I am doing an IO stress test on backup2010- if I go overboard and crash it/fill up its disks don't worry, it is not currenly pooled on any service [09:09:34] s5 eqiad snapshot wrong_size 4 hours ago 490.9 GB -11.5 % The previous backup had a size of 554.9 GB, a change larger than 5.0%. [09:11:13] going for a coffee while I keep hitting backup2010 [09:32:25] FIRING: SystemdUnitFailed: prometheus-ethtool-exporter.service on backup2010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:37:48] jynus: maybe downtiming that host would be good while you stress it? [09:52:25] FIRING: [3x] SystemdUnitFailed: prometheus-dpkg-success-textfile.service on backup2010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:04:37] Emperor: it had all alerts disabled [10:05:22] But I don't even know where alerms like that come from ^ [10:10:00] that's the catch-all alertmanager alert for a systemd unit failing [10:11:03] if you go to https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed you can see them (and I think silence them for this host for a suitable period) [10:12:25] FIRING: [3x] SystemdUnitFailed: prometheus-dpkg-success-textfile.service on backup2010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:14:11] I will just restart the host to make sure it boots correctly [10:17:25] RESOLVED: [3x] SystemdUnitFailed: prometheus-dpkg-success-textfile.service on backup2010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:17:23] change_tag in wikidata is such a pain to operate with Query OK, 940052169 rows affected (9 hours 36 min 53.983 sec) [15:50:03] I'm doing a huge deletion on backup1-eqiad too, ignore any lag (despite downtiming alerts there) [15:50:45] marostegui: I tried to put the flows we discussed yesterday in a diagram, could you please do a quick review? https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas#Overview_diagram [15:58:31] dhinus: I am about to log off for the day, I will get back to you tomorrow [16:02:52] no rush!