[03:48:23] (SystemdUnitFailed) firing: production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:48:23] (SystemdUnitFailed) firing: production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:23:17] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10Aklapper) a:05Paladox→03None @Paladox: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cl... [09:18:03] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [10:18:51] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10ops-monitoring-bot) Host rebooted by jelto@cumin1001 with reason: None [11:02:10] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:02:43] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:48:23] (SystemdUnitFailed) firing: production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:49:00] jbond: this is probably another service unit that we could forward to service ops :D [11:49:03] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:50:11] volans: i wasn't sure if that was handled by serviceops or us aka moritzm ? [11:51:48] I *think* that's a job j.oe added/worked on recently but I might be mistaken [11:51:57] I can check puppet blame later [11:51:59] I don't think we really own this, maybe the base images, but the production images surely not [11:52:46] it won't hurt to alert here, but it surely should also alert serviceops (and serviceops-collab?) [11:53:18] or maybe rather all SRE teams actually, if there's e.g. an issue with the Java images, that can affects practically all SRE sub teams [11:53:27] im not sure the logic is there to alert multiple teams yet, at least i don't know how to do that so it would be either or at this point [11:54:13] hmm im not sure we can to all either. to do that we would basicly need to remove the team tag [11:54:45] but im also not sure how to easily do that either (more due to the overloaded logic allready present) [11:55:41] ultimatly though we should alert who ever is going to fix the issue not neccesarily everyone who may be affected [12:00:30] https://gerrit.wikimedia.org/r/c/operations/puppet/+/976198 will update it to serviceops eitherway [12:32:43] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [13:23:32] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10jbond) [13:24:12] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [13:39:00] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Jelto) @hashar `gerrit2002` was migrated to puppet7. I restarted gerrit and apache processes and the instance looks fine so far. Could you double check `ger... [15:15:51] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [15:48:23] (SystemdUnitFailed) firing: production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:55:50] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [15:59:09] 10Puppet, 10Instrument-ClientError: Google Translate and other translate services triggering client error alert - https://phabricator.wikimedia.org/T351738 (10Jdlrobson) [16:13:20] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10hashar) >>! In T349619#9348943, @Jelto wrote: > @hashar `gerrit2002` was migrated to puppet7. I restarted gerrit and apache processes and the instance looks... [16:53:23] (SystemdUnitFailed) firing: (2) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:58:23] (SystemdUnitFailed) firing: (2) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:10:43] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10Dzahn) 05Open→03Resolved a:03Dzahn I am going to be bold and call it resolved. Based on my previous comments. We created a Hiera k... [17:10:51] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10User-jbond: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10Dzahn) [17:38:23] (SystemdUnitFailed) firing: (2) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:38:23] (SystemdUnitFailed) firing: (2) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:38:23] (SystemdUnitFailed) firing: production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed