[07:54:06] 10netops, 10Infrastructure-Foundations: Telia ulsfo transit v4 BGP down - https://phabricator.wikimedia.org/T311038 (10ayounsi) p:05Triage→03High [08:23:24] Hey, can I ask a question about rsync::server::module please? swift::ring_manager uses it, passing $ensure (cf https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/swift/manifests/ring_manager.pp#70 ). But every swift frontend that _isn't_ the ring_manager (e.g. ms-fe1010) is still getting the rsync server installed, with no rsync.conf ; that means that puppet tries to start the rsync serve [08:23:25] on every puppet run. It looks like maybe rsync::server::module isn't passing $ensure on to rsync::server (maybe because the latter has $ensure_service instead?)? But I'm probably Doing It Wrong... [08:30:27] 10netops, 10Infrastructure-Foundations, 10SRE: Telia ulsfo transit v4 BGP down - https://phabricator.wikimedia.org/T311038 (10ayounsi) > Kindly be informed that we have logged your issue under ref 01420952, we will investigate and get back to you with our findings. [08:34:58] Emperor: I'm no expert, BUT [08:35:40] https://github.com/wikimedia/puppet/blob/production/modules/rsync/manifests/server/module.pp#L59 [08:35:40] and [08:35:40] https://github.com/wikimedia/puppet/blob/production/modules/rsync/manifests/server.pp#L22 [08:36:06] so it's what you're saying [08:38:37] so the fix might be to replace `include ::rsync::server` with: https://www.irccloud.com/pastebin/tdUBKYA0/ [08:39:29] where xxx is something that converts $ensure to $ensure_service (especially for present vs. running) [08:41:06] I'm not sure if that wouldn't end up potentially declaring the rsync::server class more than once? [08:43:17] no idea :) [08:50:55] XioNoX, jbond, topranks: I'd like to test the ganeti group support on netbox-next if that's possible, but that would require to restore it to a clean copy of production. I don't know if there is anything currently WIP there. [08:51:09] both in terms of code and DB [08:51:43] volans: nothing my end [08:53:15] volans: yeah all good on my side fire away [08:53:16] volans: go for it! [08:53:28] thanks, will do shortly [10:05:36] 10netbox, 10Infrastructure-Foundations: Netbox DB is growing out of control - https://phabricator.wikimedia.org/T311048 (10Volans) p:05Triage→03Unbreak! [10:06:22] jbond, XioNoX FYI I've created ^^^, IMHO as unbreak, but feel free to downgrade it if you feel it's ok [10:06:42] link? [10:06:52] got it [10:07:23] 10netbox, 10Infrastructure-Foundations: Netbox DB is growing out of control - https://phabricator.wikimedia.org/T311048 (10Volans) [10:08:31] volans: "That I guess is run by prometheus at an unnecessary frequency, we could run it once a day." -> https://gerrit.wikimedia.org/r/c/operations/puppet/+/806422 [10:09:10] we should instead tune https://github.com/wikimedia/operations-software-netbox-extras/blob/master/tools/custom_script_proxy.py [10:09:35] XioNoX: it's still 720 times more than needed everyday :D [10:09:53] volans: I know, but see the history, running it once a day is not possible [10:10:02] tune in the ssense of caching it? [10:11:06] we could decouple the proxy and the "running of the script" [10:11:16] have a cron that run the script once a day [10:11:33] and the proxy that only takes care of displaying the result [10:13:41] volans: we can also set this https://docs.netbox.dev/en/stable/configuration/dynamic-settings/#jobresult_retention [10:13:50] to something not high [10:14:57] we went in production less than a week ago [10:15:15] doesn't seem a "long" period [10:15:36] true [10:15:51] i think we should instead look at changing the retention policy for the report results [10:16:09] disagree, that's not the problem IMHO [10:16:11] promethous shuld be able to queryh evetry seconds but we dont need to keep a report for every second [10:16:34] having some history can be useful though [10:16:53] I'd like to keep the history for nomal scripts and reports [10:16:53] well, for right now we can also disable it all together, it's not a critical feature [10:16:59] and that history wioll be graphed in promethuse [10:17:15] sure but ideally we should be able to configure it on a per report/script basis [10:17:16] 10SRE-tools, 10Spicerack: Allow to dry_run RemoteHosts.wait_reboot_since() and PuppetHosts.wait_since() - https://phabricator.wikimedia.org/T311050 (10JMeybohm) [10:17:30] we can't configure the retention per scrip though [10:17:40] it's a global config AFAICT [10:17:49] but we can investigate how to do that [10:18:22] even 1 day would be too much IMHO for something that runs 720 times a day at minimum [10:18:39] so we have multiple ways of tackling this problem [10:18:41] (or is that multiplied by N prometheus scrapers?) [10:18:43] volans: for the extras report i would have it with 0 retension if possible [10:18:55] jbond: Set this to 0 to retain changes in the database indefinitely. [10:19:02] so not an option apparently [10:19:08] im sure oyu know what i ment [10:19:13] yes [10:19:18] but I don't think netbox allows it [10:19:50] I think we should just a) not run the script often, just read the last result to make prometheus happy [10:19:58] we could set it to 1 briefly to clean up the DB once we have the final fix [10:20:22] XioNoX: nah we can delete them manually only for this script [10:20:27] and keep the other's history [10:20:28] sounds good [10:20:57] b) we could evaluate longer term to migrate this script to a plugin (or see if anyone did something similar already) to have it as a dedicated API endpoint [10:21:18] volans: https://github.com/networktocode/ntc-netbox-plugin-metrics-ext :) [10:22:38] XioNoX: that looks nice, i lke the fact that we can also add our own hooks for metricts [10:24:10] volans, jbond, at that rate, should we temporarily disable the prometheus scrapping? [10:24:34] XioNoX: sgtm [10:25:12] I'll have a quick look at that plugin [10:25:19] see if it's compatible as-it [10:25:25] ack sgtm [10:25:49] godog, what's the cleanest way to temporarily disable a prometheus job? [10:26:01] cf https://phabricator.wikimedia.org/T311048 [10:27:10] once disabled I can take care of deleting teh results via django APIs [10:27:21] (nbshell) [10:28:24] XioNoX: comment the job out of prometheus config [10:33:27] volans: good catch btw :) [10:34:03] what I don't get is the discrepancy between the postgres table size and the uncompressed backup size... [10:34:21] but I didn't investigate too much, I'd say let's fix this, cleanup and then check it after [10:34:26] if there is anything else [10:34:39] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox DB is growing out of control - https://phabricator.wikimedia.org/T311048 (10jbond) somewhat related, looks like we should maybe run the following [[ https://github.com/netbox-community/netbox/blob/master/docs/administration/housekeeping.md |... [10:34:53] https://gerrit.wikimedia.org/r/c/operations/puppet/+/807091 [10:35:11] godog: ^ [10:37:02] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox DB is growing out of control - https://phabricator.wikimedia.org/T311048 (10Volans) >>! In T311048#8016198, @jbond wrote: > somewhat related, looks like we should maybe run the following [[ https://github.com/netbox-community/netbox/blob/mast... [10:40:05] thanks! [10:40:06] merged and running puppet on prometheus1005 [10:40:31] if no errors there I have to got to lunch [10:41:28] XioNoX: thanks, lmk when done [10:41:34] I can check it's not polling anymore [10:41:37] and then cleanup [10:43:23] all done on prometheus1005 [10:43:59] how many prometheus scrape this? [10:44:11] two in eqiad and two in codfw [10:44:31] site-local that is [10:44:43] so all netbox eqiad gets scraped 2x, ditto for codfw [10:45:04] I still got 4 calls for that script at minute :44 [10:46:19] mmhh 4x from prometheus for a single host ? [10:47:18] I'm just saying I'm seeing 4 calls/minute, didn't check the callers IP (yet) [10:47:27] and they are still coming [10:48:17] ok, checking [10:48:58] 2620:0:861:102:10:64:16:62, 2620:0:861:101:10:64:0:82 seems to be the callers [10:49:29] so prometheus100[5-6] [10:51:19] and yes it's 2 minutes frequency for each [10:51:25] ack [10:52:01] 1005 might have stopped, 1006 seems to keep going [10:52:03] did you change anything? [10:52:14] do we just force a puppet run on 1006 too? [10:52:42] yeah I issued a reload on prometheus@ops on 1005, I am running puppet now on 1006, the reload should happen [10:53:28] standing by for puppet to complete [10:54:19] Notice: /Stage[main]/Profile::Prometheus::Ops/Prometheus::Server[ops]/Exec[prometheus@ops-reload]: Triggered 'refresh' from 1 event [10:54:31] the expected run at 54:11 didn't run [10:54:44] so it might have correctly stopped :) [10:55:25] yeah not sure tbh why puppet-issued reload didn't work on 1005, but it looks like it did on 1006 [10:56:53] ack, thanks godog [11:00:32] volans: can we do something like this https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/807095 [11:00:49] * volans deleting the old job results [11:00:53] sure [11:00:53] (see SAL) [11:01:06] * jbond can definetly wait [11:02:47] I'm not sure if that might create issues for two main reasons: [11:03:34] 1) if deleting the same kind of object while running it might cause issues in some netbox logic (current or future) [11:04:07] 2) potential race conditions if 2 calls for the same script comes around the same time [11:05:15] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox DB is growing out of control - https://phabricator.wikimedia.org/T311048 (10Volans) I've opted to run: ` >>> jobs = JobResult.objects.filter(name='getstats.GetDeviceStats') >>> for job in jobs: ... job.delete() ... ` Instead of the more c... [11:06:38] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox DB is growing out of control - https://phabricator.wikimedia.org/T311048 (10Volans) I've also run the `housekeeping` job: ` # python manage.py housekeeping [*] Clearing expired authentication sessions Sessions cleared. [*] Checking for expir... [11:07:47] jbond: I did run the housekeeping job too, but it's a bit worrying the amount of changelog it deleted [11:07:51] did the setting change? [11:08:18] oldest changelog is now from 2022-03-23 [11:08:33] we shoul dhave 720 days of changelog :/ [11:09:42] tbh i think that using a plugin is the better way to go, however in relation to 2. i think that confining the scrip to the job status so that it only deletes completed reports should be enough. however we can also update the sql query to delete all but the last N reports or completed reports older then ~5 minutes (or something elses). in relation to 1. i think it shold be possible to at [11:09:48] leats gett this working with the current version of ... [11:09:51] ... netbox as it is based on something that was removed in 3.2 so i think its safe for now. and as long as its confind to this script then im not sure it adds much more in future upgrades as we allready have to test everything. however as said, my vote would be to add something now to buy us time to move to using some plugin [11:10:14] volans: im not sure about the settings but the default for CHYANGELOG_RETENTION is also 90 days [11:12:00] i dont see CHANGELOG_RETENTION in configuration.py so i assume its using the default [11:12:41] https://gerrit.wikimedia.org/r/c/operations/puppet/+/790681 [11:12:53] it was deleted, why?!?!?! [11:12:57] now I have to restore it from the backups [11:13:24] i suspect it was not intentional, ill create a CR to add it back [11:13:51] also PREFER_IPV4 = False [11:13:53] has been deleted [11:13:54] why? [11:14:32] it's in the dynamic settings [11:14:44] https://netbox.wikimedia.org/admin/extras/configrevision/add/ [11:15:16] but was not set [11:15:25] ack so looks like we missed a step to configuyre theses via the gui when we upgraded? [11:15:28] like the others [11:15:35] that were moved [11:15:53] These configuration parameters are primarily controlled via NetBox's admin interface (under Admin > Extras > Configuration Revisions). These setting may also be overridden in configuration.py; this will prevent them from being modified via the UI. [11:15:58] did we choose to make them dynamic? [11:16:07] it seems they could have lived in the static config [11:16:19] seems much safer IMHO [11:16:27] and tracked/auditable [11:23:54] volans: looks like we missed that sentence [11:24:15] * jbond drafting something to add them back [11:25:15] I'll look at restoring them in the afternoon after lunch, at least is stale data so shouldn't be that much of a problem (hoping that netbox/django/postgres don't re-use the same unique identifiers in the meanwhile) [11:25:44] * volans has to go for lunch now, bbiab [11:29:07] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox DB is growing out of control - https://phabricator.wikimedia.org/T311048 (10Volans) p:05Unbreak!→03High >>! In T311048#8016250, @Volans wrote: > ` > [*] Checking for expired changelog records > Deleting 54103 expired records... Done. > `... [11:37:47] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Allow to dry_run RemoteHosts.wait_reboot_since() and PuppetHosts.wait_since() - https://phabricator.wikimedia.org/T311050 (10jbond) [11:50:12] jbond: https://netbox-next.wikimedia.org/api/plugins/metrics-ext/app-metrics [11:50:43] it's a local change so puppet might rollback the config file anytime [11:52:17] https://phabricator.wikimedia.org/P29937 [12:00:21] XioNoX: looks good b ut not a 1 to 1 replacment for the cuyrrent getDevices endpoint. not sure if thats an issue [12:00:30] anyway time for me to go for lunch now :) [12:00:52] jbond: yeah, your patch should solve the most urgent issue [12:01:13] I'm opening a task for the follow up actions [12:18:23] sgtm [12:23:52] 10netbox, 10Infrastructure-Foundations: Netbox: replace getstats.GetDeviceStats with ntc-netbox-plugin-metrics-ext - https://phabricator.wikimedia.org/T311052 (10ayounsi) [12:29:15] 10netbox, 10Infrastructure-Foundations: Netbox: replace getstats.GetDeviceStats with ntc-netbox-plugin-metrics-ext - https://phabricator.wikimedia.org/T311052 (10ayounsi) p:05Triage→03Low [13:37:41] I made T311066 to track the puppet/rsync issue for now [13:37:42] T311066: rsync::server::module installs an rsync server even when $ensure is absent - https://phabricator.wikimedia.org/T311066 [13:44:35] Emperor: if that's the only thing using rsync on the swift frontends, you could explicitly declare `class { 'rsync::server': }` and set the required args there [13:46:15] that definitely won't trip me up in future ;) but might be a sensible workaround [13:47:55] (sort-of feels like the wrongish shape of answer though) [13:56:10] jbond: do you recall why .gitmodules is modified on netbox-dev2002? [13:56:17] in /srv/deployment/netbox/deploy [14:12:00] also, do you know why postgrs is connecting to itself? [14:12:02] postgres: netbox netbox 10.192.48.191(53574) idle [14:12:14] this prevents the db drop to restore the backup on netbox-next [14:18:48] https://wikitech.wikimedia.org/wiki/Netbox#Flush_caches_after_a_restore is not valid anymore [14:20:27] and now got password authentication failed for user "netbox" [14:23:47] had to manually change it from postgres to the value in the config as now prod and dev have a different one (should be added to the docs) [14:57:56] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox DB is growing out of control - https://phabricator.wikimedia.org/T311048 (10Volans) I've restored yesterday's DB backup to netbox-dev2002 (netbox-next), deleted all the changelog existing in current netbox production: ` netbox=# delete from e... [15:01:38] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox DB is growing out of control - https://phabricator.wikimedia.org/T311048 (10Volans) We're back to a DB dump of ~78MB (~8.5MB compressed). [15:06:47] XioNoX apologies if this is a dumb question, but do you know why I was added to https://phabricator.wikimedia.org/T311039 ? Is my team responsible for this Maps application? [15:07:37] inflatador: Gehel's team was responsible at some point, dunno where it's at now [15:07:59] OK, I'll ask him. Doesn't look like us but I'll verify [15:08:37] thanks! don't hesitate to re-shuffle the people tagged :) [15:10:07] XioNoX: it's core platform these days [15:10:42] you can add Hugh and msantos [15:13:55] ACK, added them to the ticket [15:22:58] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox DB is growing out of control - https://phabricator.wikimedia.org/T311048 (10jbond) >>! In T311048#8016284, @Volans wrote: > This should not have happened, the CHANGELOG_RETENTION setting was removed in https://gerrit.wikimedia.org/r/c/operat... [15:23:36] volans: i thik the gitmodules thing was related to us sqitching between using tags and branches for deploy. AFAIK it can be reverted [15:23:55] I didn't touch that so far, just FYI [15:24:09] the rest I reverted to be clean like prod AFAICT [15:24:10] ack, im not sure about the postgres connections [15:24:34] ack sgtm [15:25:11] I did endup killing them (was 1 or 2) and then having to restart uwsgi to make netbox-next work too [15:29:28] hmm wierd [15:34:15] 10netops, 10Infrastructure-Foundations, 10SRE: Telia ulsfo transit v4 BGP down - https://phabricator.wikimedia.org/T311038 (10ayounsi) 05Open→03Resolved a:03ayounsi > This should be fixed. Looks like it was a configuration failure during the planned migration PWIC218882.3. Confirmed resolved. [15:35:16] XioNoX, jbond: latest run of the netbox ganeti group sync done, see my last comment on https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/802179/ [15:38:35] volans: lgtm [15:39:07] great, thx, this what I wanted to do this morning... and why I found out of the db growth :D [15:48:55] I'll probably delay the merge to tomorrow at this point [15:57:01] ack thanks [16:07:47] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox DB is growing out of control - https://phabricator.wikimedia.org/T311048 (10Volans) >>! In T311048#8017056, @jbond wrote: >>>! In T311048#8016284, @Volans wrote: > >> This should not have happened, the CHANGELOG_RETENTION setting was removed... [16:08:07] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox DB is growing out of control - https://phabricator.wikimedia.org/T311048 (10Volans) Next step is to set it as a timer to run daily or so on the primary netbox host. [22:07:45] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10IPv6: Some Traffic clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271144 (10BCornwall) a:03BCornwall [22:07:58] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Traffic, 10IPv6: Some Traffic clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271144 (10BCornwall)