[00:06:04] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on netbox1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [01:19:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:06:04] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on netbox1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [05:19:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:04] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on netbox1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [09:19:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:55:19] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Netbox rq.timeouts.JobTimeoutException - https://phabricator.wikimedia.org/T341843#10068961 (10ayounsi) `name=Job dispatched to netbox1003 - takes less than 2min Aug 16 09:24:25 netbox1003 python[1079619]: 09:24:25 default: extras.scripts.run_script... [09:55:52] elukey: I think I figured out the rq issue in Netbox: https://phabricator.wikimedia.org/T341843#10068961 :) [10:09:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_rq-netbox.service on netbox2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:06:04] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on netbox1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [12:36:52] jayme: o/ re: timeout, Papaul reported a similar issue, our suspicion is that partman for some reason does something like zero-ing all the disks behind the scenes. You don't have the logs about what partman/d-i was doing by any change right? [12:40:48] XioNoX: o/ re rq issue, nice work! It seems similar to what we are experiencing with Debmonitor [12:41:07] (codfw trying to commit to the eqiad db) [12:45:02] elukey: it was the actual mkfs.ext4 call that took so long [12:45:29] I suspected that was because of the parallel raid rebuild, but that does not seem super plausible [12:46:25] and I also wonder why the cookbook actually failed. Reading the code seems to suggest that is should fall into a ask_confirmation - but it did not [12:50:37] It is worth to open a task, indeed it would be nice to ask_confirmation. I suggest a task since you could paste the erorr that you got etc.. so it will be easier for us to create a fix [13:08:36] elukey: replied to your comments, we seem to have opposite ways of doing things :) [13:09:07] elukey: if you have some time for an extra review: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1060075 [13:16:30] done! [13:18:40] XioNoX: replied, my personal preference is not to use active_server inside the class [13:20:08] elukey: not sure I understand the 2nd part of your comment [13:32:22] XioNoX: I was trying to say that selecting what flags are on/off (for rq, other features in the future as you mentioned) seems to be something more belonging to the profile, where we can get the hiera value and decide what to do. [13:39:10] so you suggest to change the current profile's $active_ensure one liner to a larger if/else that sets the $active_ensure but also something like $enable_rq_netbox ? [13:40:20] elukey: also are you familiar with Redis? Ideally we wouldn't have to disable rq-netbox on the secondary node, but have as much active/active as possible [13:43:10] XioNoX: I am familiar with Redis, I've read what you wrote in the task about the latency and indeed non having cross-dc calls [13:44:19] elukey: so it's doable? [13:44:21] XioNoX: re: active_ensure - we still don't have any idea if we'll need a larger if block or not, for the moment IIUC it is only rq that is different from active/standby, possibly changing in the future if we want active/active [13:45:06] XioNoX: the third option is really difficult and I wouldn't suggest it, the one that you are proposing seems good, but it needs to be taken into consideration when failing over [13:45:18] so more active/standby [13:47:17] elukey: ah ok, so you suggest to pass the current $active_ensure from the profile to the netbox module as $rq_ensure, or something like that? Just want to be sure before I send a new PS [13:50:14] exactly yes, at the moment it seems the quickest. If in the future we'll need more flags etc.. then the if block might be an option [13:50:39] alright, on it [13:57:05] done :) [14:04:13] <3 [14:05:59] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Netbox: replace getstats.GetDeviceStats with netbox-more-metrics - https://phabricator.wikimedia.org/T311052#10069539 (10ayounsi) 05Open→03Resolved a:03ayounsi All done. [14:09:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_rq-netbox.service on netbox2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:14:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_rq-netbox.service on netbox2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:31:58] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: sre.hosts.reimage failing due to mkfs.ext4 taking to long - https://phabricator.wikimedia.org/T372648 (10JMeybohm) 03NEW [14:40:49] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on netbox1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:59:25] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:04:25] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:05:49] RESOLVED: PuppetConstantChange: Puppet performing a change on every puppet run on netbox1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [15:49:30] 10netops, 06Infrastructure-Foundations: Netbox ProvisionServer script fails vlan verification - https://phabricator.wikimedia.org/T372654 (10ayounsi) 03NEW p:05Triage→03High [16:30:54] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#10070003 (10elukey) Currently blocked by T372485 [19:04:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:04:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed