[00:08:49] (SystemdUnitFailed) firing: (4) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:09:47] (SystemdUnitFailed) firing: (4) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:09:47] (SystemdUnitFailed) firing: (4) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:30:04] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [10:02:09] I'm looking into why the os_reports.service is failing. For some reason a number of hosts no longer have system::role in Puppetdb [10:02:46] It currently fails on apt2001, but I think that might simply be because that's the first host in the loop [10:02:48] that's expected, I need to update the script to cover the new scheme [10:03:21] system::role got obsoleted by the new scheme were we simply pass the description values in Hiera along with the rest of the role settings [10:03:54] and half of IF services have been converted to that scheme, that's why we're only seeing this for some hosts [10:04:31] Okay, we can do a quick fix that just skips the hosts that's has been migrate, or not care :-) [10:05:52] yeah, this only needs to be fixed in the script which generates the role reports, [10:06:07] we can also simply silence the alert for a week or so, until I've fixed that [10:06:20] I'll do that then [10:09:46] ack, thx [10:11:25] Probably also need to find a way to differentiate systemd alerts, that's not exactly a critical alert [10:15:43] 10netops, 10Infrastructure-Foundations, 10sre-alert-triage: Alert in need of triage: BGP status (instance cr1-drmrs) - https://phabricator.wikimedia.org/T357389 (10LSobanski) [10:57:55] full ack [12:09:47] (SystemdUnitFailed) firing: (3) dump_cloud_ip_ranges.service on puppetmaster2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:13:50] (SystemdUnitFailed) firing: (3) dump_cloud_ip_ranges.service on puppetmaster2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:30:23] godog: I fail to understand the difference between the above two messages... what is AM trying to tell us? [12:50:13] also the units seems to be ok to me [13:11:38] volans: interesting, yeah I'm not sure offhand [13:11:40] I think one is the firing, and the other is the "resolve" [13:12:18] both with "firing: (3)" is unsual though [13:12:19] I've seen a few of these before, I think XioNoX pointed out that is seems to happen when there's a count, in this case 3 [13:13:35] but the count didn't change [13:14:15] unless that prints also if 2 were added and another 2 were removed, keeping the total at 3 [13:14:32] but also, why is alerting? [13:14:48] the linked dashboard on grafana is not helpful [13:15:06] I'm not sure offhand why puppetmaster2001 and not puppetserver200[12] like on alerts.w.o [13:15:32] I agree on the dashboard, needs improving for sure [13:16:03] The service failed, so it should alert [13:16:19] not on puppetmasters, I checked before [13:16:44] Oh, no I'm on puppetserver [13:16:57] there is failed because: [13:16:58] ERROR:root:GCP: Expecting value: line 1 column 1 (char 0) [13:17:23] Google probably deprecated something. [13:17:42] or just failed to download it temporarily [13:17:54] that should also go to serviceops I think [13:20:28] as now works [13:20:59] also I'm not sure why it's running on all puppetmaster/servers, isn't committing to the private repo? [13:22:52] Do we know what consumes the list of IPs? [13:23:49] (SystemdUnitFailed) firing: (2) dump_cloud_ip_ranges.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:24:25] Well, at least the count went down [13:24:26] requestctl [13:24:43] Ah, makes sense, thanks [13:25:06] sorry this specific timer saves it to volatile [13:26:44] ahhh got it [13:26:57] only on puppetmaster1001 it gets run with -c that actually commits to the private rep [13:28:30] still weird to me it writes in volatile of all hosts [13:28:39] but that's digressing [13:30:49] re: systemd-status dashboard, I've added a followup to T332764 [13:30:50] T332764: Port base host checks from Icinga to Alertmanager - https://phabricator.wikimedia.org/T332764 [13:58:07] there's a puppetserver ca.conf change that doesn't require a restart (uncommenting a default value), anything I need to do/know at this point re: restarting puppetserver? or ok to leave it be? [14:00:08] for puppet server restarts we currently disable Puppet fleet-wide, it's simpler than moving servers out of the service records (unless reimages and longer term maintenance things) [14:00:55] let's maybe restart one server (like1003 or so) to ensure that there's no regressions and then the rest can happen organically via future restarts? [14:04:56] moritzm: yeah 1003 sounds good to me [14:06:01] is there a doc and/or cookbook I could follow or in case of one server I can just restart it ? [14:13:55] restarting directly is also fine, the restart takes a bit and there will be some Puppet failures, but that's okay. or otherwise disable Puppet fleetwide using 'cumin A:all "disable-puppet 'puppetserver1003 restart'" [14:14:15] but it's fine to just go ahead, every agent retries 30 mins later anyway [14:17:56] ack thanks moritzm ! will restart now [14:18:06] sounds good, thanks [15:59:00] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10klausman) [16:02:36] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10Jhancock.wm) Forgot to update earlier. Rack is physically ready [16:06:00] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=349240a0-30c3-4371-9418-7f1f46072237) set by cmooney@cumin1... [16:10:34] 10Mail, 10Infrastructure-Foundations, 10Trust-and-Safety: Mail from Bishzilla to emergency@wikimedia.org is possibly getting lost - https://phabricator.wikimedia.org/T338032 (10RoySmith) This is really frustrating. I've apparently been added to the email thread for the zendesk ticket, but I can't access th... [16:21:37] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10Jhancock.wm) rack is physically ready for tomorrow. [16:49:10] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10cmooney) All work completed, no issues to report :) [16:52:15] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10MatthewVernon) Swift looks happy, thanks :) [17:09:37] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10cmooney) >>! In T355863#9538876, @MatthewVernon wrote: > Swift looks happy, thanks :) great, thanks for the update! [17:24:47] (SystemdUnitFailed) firing: dump_cloud_ip_ranges.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:41:03] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10BCornwall) [17:41:24] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10BCornwall) [17:47:46] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10ssingh) One more data point: note that `gnt-instance console FQDN` is broken because of T309724 so we don't know the exact failure. [17:54:00] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Volans) The cookbook doesn't reboot the host once in the Debian Installer, it's the Debian Installer that reboots the hosts once the base installation is c... [18:04:09] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10ssingh) >>! In T357449#9539221, @Volans wrote: > The cookbook doesn't reboot the host once in the Debian Installer, it's the Debian Installer that reboots... [18:47:17] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10BCornwall) Thanks for the response, @Volans >>! In T357449#9539221, @Volans wrote: > The cookbook doesn't reboot the host once in the Debian Installer, i... [19:35:59] 10Mail, 10Infrastructure-Foundations, 10Trust-and-Safety: Mail from Bishzilla to emergency@wikimedia.org is possibly getting lost - https://phabricator.wikimedia.org/T338032 (10Urbanecm) >>! In T338032#9530131, @Dzahn wrote: >>>! In T338032#9530072, @RoySmith wrote: >> Is there some way I can track those ze... [21:24:47] (SystemdUnitFailed) firing: dump_cloud_ip_ranges.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:56:38] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Volans) I think there is some confusion, let me clarify some things: 1) BusyBox is the environment available during debian installer. That's totally norma... [21:57:25] 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, 10GitLab (Auth & Access), and 2 others: Add GitLab to offboarding workflow - https://phabricator.wikimedia.org/T339843 (10brennen) [22:05:32] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Volans) I think there is some confusion, let me clarify some things: 1) BusyBox is the environment available during debian installer. That's totally norma... [22:08:26] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Volans) I think there is some confusion, let me clarify some things: 1) BusyBox is the environment available during debian installer. That's totally norma... [22:16:50] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Volans) I think there is some confusion, let me clarify some things: 1) BusyBox is the environment available during debian installer. That's totally norma... [22:57:40] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Dzahn) >>! In T357449#9540286, @Volans wrote: > 2) If you run `sudo gnt-instance console --show-cmd ncmonitor1001.eqiad.wmnet` it's very easy to see the co... [23:04:30] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Dzahn) > just a normal d-i partman configuration issue The code change from https://gerrit.wikimedia.org/r/c/operations/puppet/+/1002674/3/modules/profil... [23:10:32] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Volans) @Dzahn you can get a working console either setting the known hosts files to /dev/null and the strict checking to no in the ssh command running it... [23:42:48] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Volans) @Dzahn you can get a working console either setting the known hosts files to /dev/null and the strict checking to no in the ssh command running it... [23:44:25] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Dzahn) @Volans Ah, yes, i can get a console when running `sudo /usr/lib/ganeti/tools/kvm-console-wrapper /usr/bin/socat ncmonitor1001.eqiad.wmnet /var/run/... [23:48:10] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Dzahn) @BCornwall .. but then after installing the base system it fails at installing grub in /dev/sda.. which is not expected.