[08:09:24] XioNoX, jbond: are you still planning to do the netbox upgrade today? Did I miss the announcement to ops? I think we should give some advance notice of the upgrade. [08:17:45] volans: we won't do it today, and yes we're going to send advance notice to SREs [08:18:25] ack, thx :) [08:46:47] jbond, volans, any idea why in -dcops the netbox alerts only show the "problem" but never the recovery? [08:47:01] or should I ask observability? [08:47:10] is it flapping in icinga by any chance? [08:47:40] if it flaps between critical and unknown/warning that would explain it [08:48:12] yep [08:48:13] CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds. [08:48:36] I'll have a look, thx for the pointer [08:59:53] Good morning team. I'm back. [09:00:13] welcome back! [09:01:55] welcome back jobo! :) [09:02:49] Have I missed much? :) [09:02:59] you just have 99.999 unread emails [09:03:04] :-P [09:04:09] :D quite close [09:04:36] welcome back! [09:06:10] welcome back jobo [09:32:15] volans, jbond of course I can't reproduce with `alert1001:~$ /usr/lib/nagios/plugins/check_nrpe -2 -u -H 10.64.0.186 -c check_check_netbox_accounting -t 10` even running a lot of them [09:33:21] however it seems to fail every 4h, around the 33min mark [09:33:44] what does the check do? [09:33:53] netbox removes the results when the report is running [09:34:01] and until the new report is there there are no results [09:35:04] https://github.com/wikimedia/puppet/blob/production/modules/icinga/files/check_netbox_report.py that's the script [09:35:16] command[check_check_netbox_coherence]=/usr/bin/python3 /usr/local/lib/nagios/plugins/check_netbox_report.py coherence.Coherenc [09:35:19] e [09:38:47] XioNoX: if it flaps to unknown I'd say it's that [09:38:57] just modify it to retry few times and it should be ok [09:39:04] doesn't llook lie it so far [09:39:10] for example the network report is running [09:39:18] and still returns [09:39:21] OK [09:40:01] XioNoX: ah.. no it's different [09:40:11] see https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=netbox1002&service=Netbox+report+accounting [09:40:28] it makes nrpe timeout [09:40:32] so it's taking too long [09:40:51] but it's instant when I'm running it manually [09:41:20] even if it's running? dunno, maybe there is soemthing that makes netbox slow around the 33min mark? [09:42:41] yeah even if it's running [09:44:09] XioNoX: https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=netbox1002&var-datasource=thanos&var-cluster=misc&viewPanel=3 [09:44:23] it's also mostly coherence and accounting, which are not the slowest reports neither [09:45:09] oh look every hour at the 33min mark! [09:45:11] :) [09:45:47] also the accounting runs at :14 and :44 so it's not overlapping its runs, it's just netbox having issues [09:46:31] oh look https://github.com/wikimedia/puppet/blob/64db7e9c7b2e797b9a73143e1f56ce54858bf589/hieradata/common/profile/netbox.yaml#L9 [09:46:36] profile::netbox::dump_interval: '*-*-* *:32:00' [10:40:35] 10CFSSL-PKI, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Update cfssl-issuer ti cert-manager 1.8.x - https://phabricator.wikimedia.org/T310486 (10JMeybohm) [10:48:29] joanna! welcome back :-) [10:50:06] 10CFSSL-PKI, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Update cfssl-issuer to cert-manager 1.8.x - https://phabricator.wikimedia.org/T310486 (10JMeybohm) [11:48:56] 10CFSSL-PKI, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Update cfssl-issuer to cert-manager 1.8.x - https://phabricator.wikimedia.org/T310486 (10JMeybohm) p:05Triage→03Low [12:15:13] wb jobo ! [12:58:18] 9vol [12:59:04] volans: XioNoX: i think the issue with netbox could be that both netbox1001 and netbox1002 are trying to generater the report (both targeting netbox.discovery.wment) [12:59:33] all timers are still active on netbox1001? [12:59:36] can we stop them? [13:00:00] yes we shuold be able to [13:00:50] +1 [13:08:44] volans: XioNoX: https://gerrit.wikimedia.org/r/c/operations/puppet/+/805125 [13:45:49] was in a meeting, nice! [14:30:32] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Import row information into Netbox for Ganeti instances - https://phabricator.wikimedia.org/T262446 (10ayounsi) Not sure yet if it's a good idea, but as Netbox 3.2 allows for objects custom fields it's possible to add a link between a cluster and ro... [14:31:24] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Import row information into Netbox for Ganeti instances - https://phabricator.wikimedia.org/T262446 (10ayounsi) [14:31:26] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.2 - https://phabricator.wikimedia.org/T296452 (10ayounsi) [14:33:07] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Import row information into Netbox for Ganeti instances - https://phabricator.wikimedia.org/T262446 (10ayounsi) [14:33:10] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.2 - https://phabricator.wikimedia.org/T296452 (10ayounsi) [14:33:19] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.2 - https://phabricator.wikimedia.org/T296452 (10ayounsi) [14:33:22] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Import row information into Netbox for Ganeti instances - https://phabricator.wikimedia.org/T262446 (10ayounsi) [16:12:02] Apologies for missing the meeting this morning, I am headed back from a wedding and my cell connection on the train was very spotty