[00:03:46] (SystemdUnitFailed) firing: (13) geoip_update_main.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:40] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.ganeti.GanetiError: Error while performing request to RAPI - https://phabricator.wikimedia.org/T353379 (10Dzahn) ooh, ok! thanks [02:17:14] (SystemdUnitFailed) firing: (8) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:47:14] (SystemdUnitFailed) firing: (8) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:32:15] (SystemdUnitFailed) firing: (7) geoip_update_main.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:33:46] (SystemdUnitFailed) firing: (7) geoip_update_main.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:31:04] I'll create cumin1002 as a VM, we had been planning to retire the current hardward in favour of a VM anyway and this allows to have a second cumin using Puppet 7 host in eqiad in parallel to cumin1001 (which can eventually be decommed) [07:31:21] but currently DBAs still need a Cumin host running on Puppet 5 [07:33:47] (SystemdUnitFailed) firing: (5) geoipupdate.service Failed on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:43:04] I restarted "geoipupdate.service" ^ [07:45:13] 10netops, 10Infrastructure-Foundations, 10SRE: Adjust "port with no description on access switch" alert - https://phabricator.wikimedia.org/T353364 (10ayounsi) Another trigger, less likely, is if someone deletes a cable connected to a configured interface. Then automation will want to remove the description... [07:47:16] (SystemdUnitFailed) firing: (5) geoipupdate.service Failed on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:47:55] uh, it shows as healthy on the host [07:49:47] restarted on the other hosts [07:52:16] (SystemdUnitFailed) firing: (5) geoipupdate.service Failed on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:27:16] (SystemdUnitFailed) resolved: geoipupdate.service Failed on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:11:00] inflatador: ack, I'll comment in the o11y channel [09:11:11] XioNoX: thanks for restarting them, are all ok now? [09:11:48] volans: those at least yeah, is there a tracking task for the geoip issue overall? [09:12:14] moritzm: ack for building cumin1002, we just need to double check/be careful with the setup for the "authoritative" repositories present there like homer private and reposync ones. I don't recall off the top of my head if all of them support 3 hosts out of the box. [09:12:57] XioNoX: I'll send you a slack link [09:13:12] volans: why not a task? [09:13:19] don't ask me :) [09:13:24] slack is a weird place to track such issues [09:13:51] see also the email to sre [09:16:52] I saw it [09:16:57] thx :) [09:17:12] still not a task :) [09:18:16] ask the owner of the process ;) [09:24:05] volans: I think reposync is fine, but the homer class only allows to configure one $private_git_peer [09:47:03] I've added a new profile option to disable homer for a new setup, then we can first install cumin1002 as a cumin host and then at a later point migrate the homer repo from cumin1001:cumin2002 to cumin1002:cumin2002 : https://gerrit.wikimedia.org/r/c/operations/puppet/+/983144/ [09:48:51] I can have a look, not sure if it would be easy to just make it support multiple hosts [09:50:03] moritzm: couldn't we use the puppet's ensure param for this? [09:50:45] wrt: multiple hosts: I don't think it's really needed, a pair of two seems fine in general and for a transition to a new host we have the new profile variable [09:51:09] k [09:51:30] wrt ensure: not all resources managed by homer are currently ensurable and given this is used for transition away from an old host doesn't seem really needed? [09:51:46] I mean when the homer repo is moved over, the host would otherwise be decommed [09:52:16] only that here in this specific case it would be kept for some DBA tasks, but it's also quick to just clean up manually a few files [09:54:06] yeah it was mostly for the timer for the daily check or similar things tht will still be around [09:54:38] in particular if the vm has been already installed [09:54:53] and so homer profile was installed [09:55:58] I could tweak the logic for $check_homer_diff_ensure to also factor in $disable_homer [09:56:33] that would be nice I guess [10:41:14] topranks, XioNoX: any of you working on cr1-eqiad/codfw in netbox? there are pending changes [10:41:25] volans: ah, where? [10:41:37] icinga-wm| PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 [10:41:40] did you add manual config for the transport link? [10:41:48] et-1-0-2.cr1-codfw, et-1-1-2.cr1-eqiad [10:41:50] ah right, dns... [10:41:50] etc... [10:41:56] it's always dns... [10:41:59] lol [10:42:24] I'm the only one that has tht alert highlight in my irc client? :D [10:42:45] 100% sure, yes [10:42:49] lol [10:42:57] it was an hint ;) [10:44:12] what do you mean? [10:44:13] :) [10:45:13] to add it to yourself :-P [10:48:17] nothing to see here folks... move along [11:17:16] (SystemdUnitFailed) firing: netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:47:16] (SystemdUnitFailed) resolved: netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:12:59] moritzm: for wikitech once cumin1002 is ready to be used it should be as easy as updating https://wikitech.wikimedia.org/wiki/Template:CuminHosts (in theory :D ) [12:14:21] oh, nice [12:14:41] volans: one small followup: https://gerrit.wikimedia.org/r/c/operations/puppet/+/983166/ [12:15:11] doh, done [12:17:03] thx [12:24:50] volans: and one more: https://gerrit.wikimedia.org/r/c/operations/puppet/+/983174/ [12:26:38] moritzm: done, yeah makes sense [12:26:44] thx [12:38:36] volans: and one more... https://gerrit.wikimedia.org/r/c/operations/puppet/+/983177/ [13:04:17] moritzm:oh, ok [13:04:18] done [13:05:43] thx [13:10:51] volans: it's gift that keeps on giving :-) https://gerrit.wikimedia.org/r/983182 should hopefully be the last one [13:21:53] lol [13:22:03] git squash them all for next time moritzm [13:23:00] (PuppetFailure) firing: Puppet has failed on netflow1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:27:59] (PuppetFailure) firing: (2) Puppet has failed on netflow1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:30:40] volans: homer issues are resolved, but there's still https://gerrit.wikimedia.org/r/983185 :-) [13:31:25] told ya it was needed :D [13:49:59] (PuppetZeroResources) firing: Puppet has failed generate resources on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:49:59] (PuppetZeroResources) resolved: Puppet has failed generate resources on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:19:59] (PuppetFailure) firing: Puppet has failed on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:27:59] (PuppetFailure) firing: (2) Puppet has failed on netflow1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:38:17] 10Puppet, 10Instrument-ClientError: Google Translate and other translate services triggering client error alert - https://phabricator.wikimedia.org/T351738 (10Jdlrobson) 05Open→03Resolved a:03Jdlrobson Thank you for helping me with this @colewhite! I can confirm the drop today! [20:20:14] (PuppetFailure) firing: Puppet has failed on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [21:03:00] (PuppetFailure) firing: (2) Puppet has failed on netflow1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [21:07:59] (PuppetFailure) firing: (2) Puppet has failed on netflow1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:41:13] (DiskSpace) firing: Disk space idp1002:9100:/ 5.967% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=idp1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace