[01:28:31] (SystemdUnitFailed) resolved: update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:08:46] jbond: https://gerrit.wikimedia.org/r/c/operations/puppet/+/905040 Did you think something like this, for the urldownloader hosts: https://gerrit.wikimedia.org/r/c/operations/puppet/+/905040 my thinking is that we expand the disks with 10GB and just reimage the hosts [09:12:12] slyngs: id suggested adding a new disk as thats the suggestions on https://wikitech.wikimedia.org/wiki/Ganeti#Adding_disk_space. however iuf yuo are going to expand the current disk then thats also fine however i iwouldn;t bother with the dedicated var mount. better to keep it simple and not have yet another parman file to maintain [09:12:29] fyi you can also just resize the disk and grownm the fs with out the reimage [09:14:03] hmm except the default is hardcoded to have 10G for root and not the rest of the disk [09:14:22] If you add a new disk, say sdb, doesn't that need to be in partman so it know what to do on the next reimaging? [09:15:31] slyngs: yes if yuo add a new disk it would need to go ionto partman but if you just expand the current disk to 20G and dont worry about an addtional var mount (i.e. / 20Gb) then i had hoped you could reuse the current part man recipe [09:15:53] but looking at flat currently i dont think thats the case [09:16:15] But then as you point out the partman would need -1 for the "rest of the disk" [09:18:07] slyngs: yes exactly i think that would be a better chgange then creating a new partman just for this box e.g. https://gerrit.wikimedia.org/r/c/operations/puppet/+/905160 [09:19:15] Makes sense, that would be useful to more people [09:20:12] So update partman to use "the rest of the disk" add 10GB of disk to urldownloader hosts and reimage them [09:20:31] sgtm [09:53:13] slyngs: cr is merged and deployed [09:54:16] jbond: Thanks, I'll grab one of the in-active urldownloaders and give it a gone a bit later [09:54:30] cheers [09:55:13] I need to go feed the bees :-) [11:33:15] jbond: Do we how when we're switching over to the new bullseye urldownloaders, because they already have enough disk [11:36:58] slyngs: not sure seems moritz is working on it T329945 they are back on wedensday [11:36:58] T329945: Migrate the URL downloaders to Bullseye - https://phabricator.wikimedia.org/T329945 [11:39:01] jbond: I'll wait with doing anything until tomorrow then, no point in spending time if the buster hosts go away in a week or so. [11:39:07] Thanks [11:39:09] +1 [12:58:31] (SystemdUnitFailed) firing: (2) discard_held_messages.service Failed on lists1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:59:50] (SystemdUnitFailed) resolved: (2) discard_held_messages.service Failed on lists1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:08:31] (SystemdUnitFailed) firing: (2) discard_held_messages.service Failed on lists1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:55:02] (SystemdUnitFailed) firing: (2) discard_held_messages.service Failed on lists1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:56:16] (SystemdUnitFailed) firing: (2) discard_held_messages.service Failed on lists1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:10:32] 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q3): validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10cmooney) I'm wondering what kind of scenario are we trying to check for here? Mostly if a device... [17:29:11] 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q3): validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10jbond) thanks for the comments > I'm wondering what kind of scenario are we trying to check for... [18:25:39] 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q3): validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10cmooney) >>! In T333007#8751620, @jbond wrote: > The script is not checking for this (and as you... [18:55:01] (SystemdUnitFailed) firing: (2) discard_held_messages.service Failed on lists1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:56:21] (SystemdUnitFailed) firing: (2) discard_held_messages.service Failed on lists1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:25:01] (SystemdUnitFailed) firing: (2) discard_held_messages.service Failed on lists1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:26:16] (SystemdUnitFailed) firing: (2) discard_held_messages.service Failed on lists1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:10:36] 10Mail, 10Infrastructure-Foundations, 10MediaWiki-extensions-TranslationNotifications, 10serviceops: Investigate if TranslationNotification's DigestEmailer.php is really sending emails and what happens to them - https://phabricator.wikimedia.org/T333899 (10MarcoAurelio) [23:26:16] (SystemdUnitFailed) firing: (2) discard_held_messages.service Failed on lists1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed