[08:44:58] I'll be enabling blackbox monitoring for IDP. [08:51:45] ack! [09:22:48] FIRING: PuppetFailure: Puppet has failed on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:28:17] I have to run a homer commit on cr*eqiad* for a wikikube worker rename/reimage. Beside the description change for the worker I have 10 neighbor removals in other group and a lot of orange line with "! neighbor {... }. [12:28:17] Can anyone with more homer experience give me a second pair of eyes? [12:28:17] I have the diff open in a tmux session on cumin1002 but I can also abort if needed [12:46:09] the orange lines should be ok afair, the deletion are surprising though but look like https://phabricator.wikimedia.org/T381175#10368381 [12:48:36] jelto: apologies that's on me [12:48:42] abort that please [12:49:06] claime: yep that's the issue - I forgot on Friday to import all the ganeti's to sort it out [12:49:08] doing that now [12:49:28] claime: a good catch, thanks! [12:49:28] I aborted the homer run [12:52:12] (it's still aborting) [12:52:35] jelto: ok, you can run immediately afterwards again it should be ok now [12:53:16] some small good news is that my patch for homer should speed up execution a bit once I get it merged [12:53:29] great thanks for the quick fix. I'll try again when homer allows me to :) [12:53:46] ok, ping me with the diff I'll take a look [13:01:52] topranks: diff looks better now, the description for the wikikube worker changed but no other new ganeti vms in the diff. Do you want to double check? The orange neighbor lines are still in the diff though [13:02:25] sounds ok - if you can put it in a paste I'll double-check [13:02:43] that sounds the same as I seen a short time ago checking my own code [13:02:59] surprised with the orange lines (re-ordered elements) but they should be fine [13:05:35] topranks: https://paste.debian.net/hidden/77d34d2d/ here is the diff [13:06:07] jelto: yep that's fine please proceed [13:06:28] ok proceeding [13:07:01] I'm scratching my head at the IP re-ordering.... but at least I now realise when I seen that it wasn't due to my replacement homer code (I'd been testing that and blaming myself for it) [13:07:27] sorry for the confusion on that, I'll get that merged this week to prevent this one happening again [13:07:54] np, thanks for the quick help :) [13:22:48] FIRING: PuppetFailure: Puppet has failed on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:29:38] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 10vrts: Mailserver refusing emails sent through VRTS due to too large headers - https://phabricator.wikimedia.org/T380696#10371902 (10jhathaway) p:05Triage→03Low [16:52:09] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 10vrts: Mailserver refusing emails sent through VRTS due to too large headers - https://phabricator.wikimedia.org/T380696#10372289 (10jhathaway) I notified their postmaster, I'll update if I receive a reply. [17:19:25] 07Puppet, 06cloud-services-team, 10Tools: Too many puppet facts on toolforge k8s workers - https://phabricator.wikimedia.org/T381293 (10Andrew) 03NEW [17:22:48] FIRING: PuppetFailure: Puppet has failed on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:23:25] 07Puppet, 06cloud-services-team, 10Toolforge: Too many puppet facts on toolforge k8s workers - https://phabricator.wikimedia.org/T381293#10372452 (10taavi) [18:14:44] FIRING: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [18:24:44] RESOLVED: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [21:22:48] FIRING: PuppetFailure: Puppet has failed on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:35:06] topranks: it should be easy (in theory) to add lowest QoS for all traffic towards HDFS, right? [22:35:33] cdanis: yep I was gonna pick your brains on that, in terms of if it was a good idea or not [22:35:57] to me it would make sense to make these jobs less important than our other traffic [22:36:27] absolutely [22:36:28] should be a few simple puppet patches if we know the type of hosts involved either side, and say port numbers or other way to base a rule on [22:37:11] I have to go now but I tagged you in the Slack thread where we were also discussing [22:37:15] what will also help are the upgrades in eqiad, but it'll be this time next year before we've A-D and all servers moved [22:37:18] ok thanks [22:37:40] the advantage of the newer switches is more bandwidth, and internal traffic doesn't route through CRs [22:37:46] but the qos would make sense regardless I think [22:37:49] RESOLVED: PuppetFailure: Puppet has failed on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:38:33] topranks: in this case, the readers (an-presto*) were mostly on upgraded rows :D [22:38:52] it made things worse I think, since the talkers are numerous (an-worker*) and all over eqiad [22:38:54] yeah we've seen a few instances of this [22:39:21] they fact their in the higher bandwidth racks mean they can send/receive more [22:39:33] yeah makes sense [23:05:53] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Heavy usage (possible scraping) of ceb.wikipedia.org from AS54801 (Dec 2 2024) - https://phabricator.wikimedia.org/T381347 (10cmooney) 03NEW p:05Triage→03High [23:06:28] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Heavy usage (possible scraping) of ceb.wikipedia.org from AS54801 (Dec 2 2024) - https://phabricator.wikimedia.org/T381347#10374057 (10cmooney)