[00:08:49] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:09:47] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:09:47] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:30:04] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[10:02:09] <slyngs>	 I'm looking into why the os_reports.service is failing. For some reason a number of hosts no longer have system::role in Puppetdb
[10:02:46] <slyngs>	 It currently fails on apt2001, but I think that might simply be because that's the first host in the loop
[10:02:48] <moritzm>	 that's expected, I need to update the script to cover the new scheme
[10:03:21] <moritzm>	 system::role got obsoleted by the new scheme were we simply pass the description values in Hiera along with the rest of the role settings
[10:03:54] <moritzm>	 and half of IF services have been converted to that scheme, that's why we're only seeing this for some hosts
[10:04:31] <slyngs>	 Okay, we can do a quick fix that just skips the hosts that's has been migrate, or not care :-)
[10:05:52] <moritzm>	 yeah, this only needs to be fixed in the script which generates the role reports, 
[10:06:07] <moritzm>	 we can also simply silence the alert for a week or so, until I've fixed that
[10:06:20] <slyngs>	 I'll do that then 
[10:09:46] <moritzm>	 ack, thx
[10:11:25] <slyngs>	 Probably also need to find a way to differentiate systemd alerts, that's not exactly a critical alert 
[10:15:43] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10sre-alert-triage: Alert in need of triage: BGP status (instance cr1-drmrs) - https://phabricator.wikimedia.org/T357389 (10LSobanski)
[10:57:55] <moritzm>	 full ack
[12:09:47] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) dump_cloud_ip_ranges.service on puppetmaster2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:13:50] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) dump_cloud_ip_ranges.service on puppetmaster2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:30:23] <volans>	 godog: I fail to understand the difference between the above two messages... what is AM trying to tell us?
[12:50:13] <volans>	 also the units seems to be ok to me
[13:11:38] <godog>	 volans: interesting, yeah I'm not sure offhand
[13:11:40] <slyngs>	 I think one is the firing, and the other is the "resolve"
[13:12:18] <godog>	 both with "firing: (3)" is unsual though
[13:12:19] <slyngs>	 I've seen a few of these before, I think XioNoX pointed out that is seems to happen when there's a count, in this case 3
[13:13:35] <volans>	 but the count didn't change
[13:14:15] <volans>	 unless that prints also if 2 were added and another 2 were removed, keeping the total at 3
[13:14:32] <volans>	 but also, why is alerting?
[13:14:48] <volans>	 the linked dashboard on grafana is not helpful
[13:15:06] <godog>	 I'm not sure offhand why puppetmaster2001 and not puppetserver200[12] like on alerts.w.o
[13:15:32] <godog>	 I agree on the dashboard, needs improving for sure
[13:16:03] <slyngs>	 The service failed, so it should alert
[13:16:19] <volans>	 not on puppetmasters, I checked before
[13:16:44] <slyngs>	 Oh, no I'm on puppetserver
[13:16:57] <volans>	 there is failed because:
[13:16:58] <volans>	 ERROR:root:GCP: Expecting value: line 1 column 1 (char 0)
[13:17:23] <slyngs>	 Google probably deprecated something. 
[13:17:42] <volans>	 or just failed to download it temporarily
[13:17:54] <volans>	 that should also go to serviceops I think
[13:20:28] <volans>	 as now works
[13:20:59] <volans>	 also I'm not sure why it's running on all puppetmaster/servers, isn't committing to the private repo?
[13:22:52] <slyngs>	 Do we know what consumes the list of IPs?
[13:23:49] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) dump_cloud_ip_ranges.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:24:25] <slyngs>	 Well, at least the count went down
[13:24:26] <volans>	 requestctl
[13:24:43] <slyngs>	 Ah, makes sense, thanks 
[13:25:06] <volans>	 sorry this specific timer saves it to volatile
[13:26:44] <volans>	 ahhh got it
[13:26:57] <volans>	 only on puppetmaster1001 it gets run with -c that actually commits to the private rep
[13:28:30] <volans>	 still weird to me it writes in volatile of all hosts
[13:28:39] <volans>	 but that's digressing
[13:30:49] <godog>	 re: systemd-status dashboard, I've added a followup to T332764 
[13:30:50] <stashbot>	 T332764: Port base host checks from Icinga to Alertmanager - https://phabricator.wikimedia.org/T332764
[13:58:07] <godog>	 there's a puppetserver ca.conf change that doesn't require a restart (uncommenting a default value), anything I need to do/know at this point re: restarting puppetserver? or ok to leave it be?
[14:00:08] <moritzm>	 for puppet server restarts we currently disable Puppet fleet-wide, it's simpler than moving servers out of the service records (unless reimages and longer term maintenance things)
[14:00:55] <moritzm>	 let's maybe restart one server (like1003 or so) to ensure that there's no regressions and then the rest can happen organically via future restarts?
[14:04:56] <godog>	 moritzm: yeah 1003 sounds good to me
[14:06:01] <godog>	 is there a doc and/or cookbook I could follow or in case of one server I can just restart it ?
[14:13:55] <moritzm>	 restarting directly is also fine, the restart takes a bit and there will be some Puppet failures, but that's okay. or otherwise disable Puppet fleetwide using 'cumin A:all "disable-puppet 'puppetserver1003 restart'"
[14:14:15] <moritzm>	 but it's fine to just go ahead, every agent retries 30 mins later anyway
[14:17:56] <godog>	 ack thanks moritzm ! will restart now
[14:18:06] <moritzm>	 sounds good, thanks
[15:59:00] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10klausman)
[16:02:36] <wikibugs>	 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10Jhancock.wm) Forgot to update earlier. Rack is physically ready
[16:06:00] <wikibugs>	 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=349240a0-30c3-4371-9418-7f1f46072237) set by cmooney@cumin1...
[16:10:34] <wikibugs>	 10Mail, 10Infrastructure-Foundations, 10Trust-and-Safety: Mail from Bishzilla to emergency@wikimedia.org  is possibly getting lost - https://phabricator.wikimedia.org/T338032 (10RoySmith) This is really frustrating.  I've apparently been added to the email thread for the zendesk ticket, but I can't access th...
[16:21:37] <wikibugs>	 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10Jhancock.wm) rack is physically ready for tomorrow.
[16:49:10] <wikibugs>	 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10cmooney) All work completed, no issues to report :)
[16:52:15] <wikibugs>	 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10MatthewVernon) Swift looks happy, thanks :)
[17:09:37] <wikibugs>	 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10cmooney) >>! In T355863#9538876, @MatthewVernon wrote: > Swift looks happy, thanks :)  great, thanks for the update!
[17:24:47] <jinxer-wm>	 (SystemdUnitFailed) firing: dump_cloud_ip_ranges.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:41:03] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10BCornwall)
[17:41:24] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10BCornwall)
[17:47:46] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10ssingh) One more data point: note that `gnt-instance console FQDN` is broken because of T309724 so we don't know the exact failure.
[17:54:00] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Volans) The cookbook doesn't reboot the host once in the Debian Installer, it's the Debian Installer that reboots the hosts once the base installation is c...
[18:04:09] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10ssingh) >>! In T357449#9539221, @Volans wrote: > The cookbook doesn't reboot the host once in the Debian Installer, it's the Debian Installer that reboots...
[18:47:17] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10BCornwall) Thanks for the response, @Volans   >>! In T357449#9539221, @Volans wrote: > The cookbook doesn't reboot the host once in the Debian Installer, i...
[19:35:59] <wikibugs>	 10Mail, 10Infrastructure-Foundations, 10Trust-and-Safety: Mail from Bishzilla to emergency@wikimedia.org  is possibly getting lost - https://phabricator.wikimedia.org/T338032 (10Urbanecm) >>! In T338032#9530131, @Dzahn wrote: >>>! In T338032#9530072, @RoySmith wrote: >> Is there some way I can track those ze...
[21:24:47] <jinxer-wm>	 (SystemdUnitFailed) firing: dump_cloud_ip_ranges.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:56:38] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Volans) I think there is some confusion, let me clarify some things:  1) BusyBox is the environment available during debian installer. That's totally norma...
[21:57:25] <wikibugs>	 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, 10GitLab (Auth & Access), and 2 others: Add GitLab to offboarding workflow - https://phabricator.wikimedia.org/T339843 (10brennen)
[22:05:32] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Volans) I think there is some confusion, let me clarify some things:  1) BusyBox is the environment available during debian installer. That's totally norma...
[22:08:26] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Volans) I think there is some confusion, let me clarify some things:  1) BusyBox is the environment available during debian installer. That's totally norma...
[22:16:50] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Volans) I think there is some confusion, let me clarify some things:  1) BusyBox is the environment available during debian installer. That's totally norma...
[22:57:40] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Dzahn) >>! In T357449#9540286, @Volans wrote: > 2) If you run `sudo gnt-instance console --show-cmd ncmonitor1001.eqiad.wmnet` it's very easy to see the co...
[23:04:30] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Dzahn) > just a normal d-i partman configuration issue   The code change from https://gerrit.wikimedia.org/r/c/operations/puppet/+/1002674/3/modules/profil...
[23:10:32] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Volans) @Dzahn you can get a working console either setting the known hosts files to /dev/null and the strict checking to no in the ssh command running it...
[23:42:48] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Volans) @Dzahn you can get a working console either setting the known hosts files to /dev/null and the strict checking to no in the ssh command running it...
[23:44:25] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Dzahn) @Volans Ah, yes, i can get a console when running `sudo /usr/lib/ganeti/tools/kvm-console-wrapper /usr/bin/socat ncmonitor1001.eqiad.wmnet /var/run/...
[23:48:10] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Dzahn) @BCornwall .. but then after installing the base system it fails at installing grub in /dev/sda.. which is not expected.