[00:00:47] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:04:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:19:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:24:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:54:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:59:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:04:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:09:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:14:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:19:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:49:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:52:35] FIRING: PuppetFailure: Puppet has failed on netboxdb2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:54:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:24:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:29:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:30:38] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:44:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:49:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:00:38] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:04:19] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:19:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:24:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:34:05] FIRING: [3x] PuppetConstantChange: Puppet performing a change on every puppet run on netbox1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [03:36:45] FIRING: [2x] SystemdUnitCrashLoop: rq-netbox.service crashloop on netbox1003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [03:54:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:59:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:14:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:19:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:49:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:54:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:24:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:29:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:44:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:49:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:52:35] FIRING: PuppetFailure: Puppet has failed on netboxdb2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:19:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:24:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:45:38] FIRING: [32x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:49:20] FIRING: [32x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:50:38] FIRING: [32x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:55:38] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:59:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:03:09] 10netbox, 06Infrastructure-Foundations: Netbox: manage VRRP priorities - https://phabricator.wikimedia.org/T319301#10001380 (10ayounsi) > This doesn't actually seem to be the case? We have no priority set on either router. Indeed, not anymore since we moved to 100G between eqiad/codfw > Is this a good idea? I... [07:03:42] XioNoX: netbox[12]003 should probably have notification disabled until they are stable enough not to spam so much here and in -operations ;) [07:12:22] volans: yep good idea [07:12:36] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#10001398 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=92ae15a3-d066-4959-9504-9286a87c9cd2) set by ayounsi@cumin1002 for 2:00:00 on 1 host(s) and their services... [07:12:36] volans: I just downtimed it, we're going to upgrade NB in 50min [07:14:57] ack, finger crossed [07:16:39] elukey, it's awesome to see the progress made on https://phabricator.wikimedia.org/T363576 ! [07:34:05] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on netbox2003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [07:36:45] FIRING: SystemdUnitCrashLoop: rq-netbox.service crashloop on netbox2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [07:39:37] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#10001438 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3f92060e-7cc5-42b2-b105-d3b395a0abd4) set by ayounsi@cumin1002 for 2:00:00 on 1 host(s) and their services... [07:43:53] 10Packaging, 06Infrastructure-Foundations, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q1): upgrade prometheus-ipmi-exporter to 1.8.0 - https://phabricator.wikimedia.org/T368088#10001441 (10Volans) I'm getting failures for this change on `db1179`: ` level=info msg="Starting ipmi_exporter" version... [07:49:20] FIRING: [2x] SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:58:17] morning folks! [08:02:33] XioNoX: thanks! There a ton of work ahead though, I am afraid [08:03:06] morning! [08:03:22] still better than being stuck with no idea about what's going on [08:19:20] FIRING: [2x] SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:29:14] XioNoX: for the homer release I am going to use python-release, since we already have the .wmfconfig file [08:29:34] all automated, but it will create another CHANGELOG code review etc.. [08:29:53] I'll copy what you added and then post the new one in here ok? [08:30:10] elukey: sounds good! [08:34:20] RESOLVED: [2x] SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:37:57] XioNoX: since there are some breaking changes (like drop support for < 3.3 etc..), would it be ok to release 0.7.0 instead of 0.6.7? [08:38:05] elukey: sure [08:38:08] super [08:39:20] FIRING: SystemdUnitFailed: postgresql@15-main.service on netboxdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:45:51] https://gerrit.wikimedia.org/r/c/operations/software/homer/+/1055877 [08:45:55] ready --^ [08:47:23] elukey: +1 [08:47:34] thanksss [08:48:00] Netbox 3.3 DB converted to 3.7 and loaded onto netboxdb1003 [08:48:52] lmk if you need help (you mentioned DB so I can speak) :-P [08:49:07] this was for jobo ^^^ :) [08:49:49] :D [08:50:32] homer 0.7.0 released! [08:50:44] thx! [08:51:07] https://pypi.org/project/homer/#history [08:51:22] \o/ [08:51:23] running the replica sync from https://wikitech.wikimedia.org/wiki/Postgres#Syncing_Postgres_replica [08:52:53] going along the steps of https://etherpad.wikimedia.org/p/netbox4-upgrade [08:55:24] ack! [08:55:31] I am going to ping the traffic team for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1055187 [08:55:47] IIUC we'll need to restart pybal on some lvses [08:55:56] what's the cumin selector to target specifically netbox1003 and netbox2003 ? [08:56:15] mmm or not [08:56:45] XioNoX: you can use 'netbox1003*', it is the easiest that comes to mind [08:56:59] or do you mean both? [08:57:09] both yeah, but one after the other would works too [08:57:10] if so, 'netbox1003* or netbox2003*' [08:57:23] 'netbox[1-2]003*' [08:58:17] mine is more colorful :P [08:58:29] lol [08:58:34] thx, of course I tried `netbox*003` and `and` [08:58:53] ok no for netbox.discovery not sure what's needed, not a pybal restart [08:59:02] btw, don't forget to update the cumin aliases for netbox once migrated, it affects surely some cookbook like the dns one [08:59:20] RESOLVED: SystemdUnitFailed: postgresql@15-main.service on netboxdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:59:26] volans: it's point 7 on the runbook :) [08:59:30] great [09:00:11] elukey: I think those IPs go directly into ATS config, but my memory might fail me [09:00:21] alright, deploying the code to 1003/2003 [09:00:43] I'd be surprised if it worked on the first go [09:01:19] RESOLVED: PuppetFailure: Puppet has failed on netboxdb2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:01:26] volans: there is a discovery record for netbox, this is why I am puzzled, maybe it is just for completeness/alarming that we have it in puppet? [09:01:43] or not, netbox.discovery.wmnet is not a CNAME [09:01:51] interesting [09:03:50] RESOLVED: PuppetConstantChange: Puppet performing a change on every puppet run on netboxdb2003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [09:05:00] getting 2 errors during the netbox migration script run: [09:05:00] ModuleNotFoundError: No module named 'psycopg2' [09:05:00] ImportError: no pq wrapper available. [09:08:54] namely https://phabricator.wikimedia.org/T336275#9913396 right? [09:10:08] elukey: more or less, yeah, when sre.deploy.python-code runs, it copies the code over, then runs https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/netbox-deploy/+/refs/heads/main/scap/checks/netbox_setup.sh [09:10:23] here it's the `${PYTHON} ${NETBOX_ROOT}/netbox/manage.py migrate` that fails [09:10:48] in the wheels there is: [09:10:51] psycopg==3.2.1 [09:10:51] psycopg-c==3.2.1 [09:10:51] psycopg-pool==3.2.2 [09:12:28] so not sure so far why it's not happy :) [09:12:39] ah ok so you are at step 5. [09:13:17] yep [09:13:41] full run: https://phabricator.wikimedia.org/P66881 [09:13:50] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on netbox1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [09:14:20] FIRING: [28x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:15:38] "PsyCop G2" [09:15:44] sounds like a sequel to an 80s action movie [09:16:30] FIRING: SystemdUnitCrashLoop: rq-netbox.service crashloop on netbox1003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [09:17:55] XioNoX: - couldn't import psycopg 'c' implementation: libpq.so.5: cannot open shared object file: No such file or directory [09:18:06] it's missing libpq that apaprently is required [09:18:13] cool, yeah [09:18:28] looks like the psycopg2 is just one of the tries [09:18:48] with a lot of slowness but I wanted to say the same :D (netbox1003 is still pristine, the error comes from 2003 afaics) [09:18:49] but I don't see libpq5 installe don netbox hosts [09:18:55] just netbox-dev and netboxdbs [09:19:08] https://netboxlabs.com/docs/netbox/en/stable/installation/3-netbox/ mentions `libpq-dev` [09:19:10] maybe it was installed manually and not via puppet [09:19:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:19:38] it's probably installed automatically installed on -dev as a postgres dependency [09:19:58] I'll install it manually and make a note to puppetize it [09:20:58] how does it worn now? [09:21:00] *work [09:21:07] volans: on prod? [09:21:10] yes [09:21:21] volans: maybe it's a new requirement for nb 4 [09:21:55] it is a reverse dep of postgress afaics [09:22:20] (the -dev package) [09:22:33] DB migration running [09:22:39] very good [09:22:55] need to run afk for ~30 mins, will be back asap [09:23:01] no pb! [09:24:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:26:03] new error, but doesn't seem blocking [09:26:08] https://www.irccloud.com/pastebin/xEtBCXP3/ [09:26:33] it's like the netbox user doesn't have rights to sudo run that command [09:26:40] I ran it manually and it went fine [09:26:57] yeah that all probably makes sense [09:27:12] if doing manually worked hopefully can just proceed [09:27:25] yep, it's running on 1003 now [09:29:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:30:30] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#10001633 (10ops-monitoring-bot) Deployed netbox to netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: Release v4.0.7 to future netbox prod - ayounsi@cumin1002 - T336275 [09:30:38] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:36:30] RESOLVED: SystemdUnitCrashLoop: rq-netbox.service crashloop on netbox1003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [09:40:42] FIRING: [31x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:43:50] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on netbox1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [09:54:20] FIRING: [29x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:56:24] volans: I might need some help making sure the Netbox DNS DB is set fine on the new servers [09:56:38] DNS DB? [09:56:41] I think it might need to be rsynced from the current 1/2003 [09:56:42] that's new [09:56:48] you mean repo? [09:57:16] volans: yeah but you're in data persistence so I have to call it a DB :) [09:58:15] right, a DB with a file backend [09:58:16] correct [09:59:14] XioNoX: so *2 are the prod ones and *3 the new ones right? [09:59:20] FIRING: [28x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:59:21] volans: yep [09:59:32] I merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1055187 so netbox.wikimedia.org should point to the new servers soon [10:00:26] so [10:00:30] /srv/reposync/netbox-hiera seems to be already fine [10:00:33] at db0187ae51ff2a162bfa82dba2d586ddbf1736e9 [10:01:49] while /srv/netbox-exports/dns.git is not setup [10:02:46] I think I can just scp it? [10:02:56] (hence my suggestion to migrate to reposync ;) ) [10:03:02] no I'll just do it with git [10:03:19] is there a task? :) [10:03:50] RESOLVED: PuppetConstantChange: Puppet performing a change on every puppet run on netbox1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:04:00] but puppet should have init the repo [10:04:03] why it didn't? [10:04:19] $ sudo head -n1 /srv/netbox-exports/dns.git/config [10:04:20] # MANAGED BY PUPPET [10:04:20] FIRING: [29x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:04:21] on 1002 [10:04:37] ah no it did [10:06:05] back! [10:06:32] give me a sec [10:06:35] adding 1002 as remote [10:06:40] temporarily [10:07:27] elukey: I pinged traffic on why https://gerrit.wikimedia.org/r/c/operations/puppet/+/1055187 isn't enough [10:08:20] XioNoX: it seems working, netbox.discovery.wmnet is updated [10:08:56] XioNoX: ok I had to do: git init --bare on /srv/netbox-exports/dns.git (as netbox user or you have to fix permissions later) [10:09:00] now I'm running [10:09:04] runuser -u netbox -- git -C "/srv/netbox-exports/dns.git" fetch netbox1002.eqiad.wmnet master:master [10:09:07] like the dns cookbook does [10:09:18] (I added temporarily netbox1002 as additional remote) [10:09:20] FIRING: [29x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:10:22] now trying to do the same on 2003 but syncing from 1003 [10:10:24] to ensure it works [10:10:56] XioNoX: something missing in1003 to export it [10:10:56] fatal: repository 'https://netbox1003.eqiad.wmnet/dns.git/' not found [10:11:06] let me run mnaully one thing [10:11:30] ok fixed [10:11:34] thx [10:11:45] I had to run manually sudo -u netbox git update-server-info, it's in the post-update hook [10:12:00] but the manual fetch probably didn't trigger it I guess [10:13:05] XioNoX: clone completed on 2003 too [10:13:22] awesome [10:13:31] hopefully that's the last time it will be needed? :) [10:13:34] removed temporary remote from 1003 [10:14:15] The following static media file failed to load: setmode.js [10:14:55] I clicked on click here to attempt to reload netbox again [10:14:56] and it worked [10:15:04] but still on (v3.2.9) [10:15:38] FIRING: [29x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:16:06] I'm on the 4 UI, but it's not prompting me for the idm logging screen, but for the regular one [10:19:20] FIRING: [29x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:19:25] (running puppet on dns nodes) [10:20:02] no IDM, confirmed [10:20:40] yeah, I missed some Netbox hiera changes [10:21:06] I thought current prod was migrated to new IDM python module [10:25:13] elukey: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1055887 (PCC running) [10:25:27] not sure if I'm missing something else though [10:26:32] https://puppet-compiler.wmflabs.org/output/1055887/1490/netbox1003.eqiad.wmnet/index.html looks satisfying :) [10:29:24] looks good but I don't have a lot of experience with that stack, let's try [10:29:33] do we require any change on the bitu front? [10:29:41] like cleanups of old tickets etc.. [10:29:55] I don't think so, but maybe also allow the new hosts somewhere [10:29:58] I recall that we had to do something similar with Simon the last time [10:32:07] let's do it and see what breaks :D [10:32:12] yeah, looks like something's missing `404 Client Error: Not Found for url: https://idp.wikimedia.org//.well-known/openid-configuration` [10:33:53] (all dns nodes updated) [10:36:47] hmm, I can't find any doc on that [10:39:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:39:27] slyngs: hello, any idea on what's missing? :) [10:39:52] I think Simon is on holidays [10:40:03] perfect timing :D [10:40:09] ah great, who is his backup? Moritz? :) [10:40:29] where do you get the error? I tried to access netbox.wikimedia.org but I get only a hanging tab [10:40:38] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:40:49] elukey: on https://netbox.wikimedia.org [10:40:55] weird [10:41:11] redirects me to https://netbox.wikimedia.org/oauth/login/oidc/?next=/ [10:41:21] ok if I don't specify https it hangs [10:41:26] does the same to me [10:41:30] okok let's see [10:42:04] I think *somewhere* we need to allow netbox1003 and netbox2003 to access that endpoint [10:42:14] doesn't seem to be in Puppet [10:42:35] lemme check the idm nodes [10:42:54] envoy runs on idm1001:443 [10:43:02] following the path :) [10:44:25] ok so we have envoy for TLS -> httpd -> bitu [10:45:38] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:48:01] elukey: mayybe if I add myself to ADMINS_LIMITED in /etc/bitu/settings.py and reload bitu? [10:48:09] I'll see more options on the UI? [10:49:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:49:28] XioNoX: I am checking the puppet code and I see profile::netbox::oidc_secret [10:49:35] 10Packaging, 06Infrastructure-Foundations, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q1): upgrade prometheus-ipmi-exporter to 1.8.0 - https://phabricator.wikimedia.org/T368088#10001892 (10Volans) Upgraded the package all clean now on db1179. [10:49:59] do we need to create another one or similar? [10:50:25] that would be in the puppet private repo [10:50:33] I was checking https://phabricator.wikimedia.org/rOPUP644d8cd5c41f3029239fef517af0c84ca8eaf14b [10:50:38] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:51:31] lemme see the puppet roles [10:51:57] elukey: there is already SOCIAL_AUTH_OIDC_SECRET on netbox1003:/etc/netbox/configuration.py [10:52:51] yep yep [10:53:55] because we also have redis on the idm nodes [10:54:08] I am wondering if changing the netbox hosts mean also to purge some old data [10:54:11] or similar [10:55:24] I think it's just an ACL somewhere on idm.wikimedia.org [10:55:38] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:56:16] the auth configuration is similar between netbox-dev2003 and netbox1003 (except url, key and password as expected) [10:56:54] so it is completely different :D [10:57:01] hahaha [10:57:15] I mean the same keys are there, so I don't think we're missing Puppet config [10:58:15] elukey: I'm going to try my idea on idm-test just in case [10:59:01] sure sure, I am checking httpd logs on idm2001 in the meantime [10:59:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:00:04] don't want to distract you but how was netbox 4 working during the tests ? no idm? [11:00:27] volans: idm :) but it was a different server [11:00:39] so different hostname and acl [11:00:57] also there was an error that I don't recall if it was similar, and Simon fixed it behind the scenes on idmXXXX [11:01:00] no idea how [11:01:23] check bash history [11:01:25] (my try didn't result in anything :) ) [11:01:26] his user or root [11:01:38] if it was UI you're screwed :D [11:05:21] where are the netbox logs? [11:05:47] /srv/log okok [11:07:27] yes [11:07:28] sorry [11:10:15] something different from netbox-dev and netbox [11:10:19] SOCIAL_AUTH_OIDC_USERINFO_URL = 'https://idp-test.wikimedia.org/oidc/profile' [11:10:34] SOCIAL_AUTH_OIDC_USERINFO_URL = 'https://idp.wikimedia.org//profile' [11:10:40] the latter looks wrong [11:10:53] ohh [11:11:24] let me fix it manually jsut to see [11:13:23] same for the SOCIAL_AUTH_OIDC_ENDPOINT key [11:13:36] progress, now I get application not authorized to use CAS [11:14:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:15:38] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:18:21] https://apereo.github.io/cas/6.6.x/installation/Troubleshooting-Guide.html#application-not-authorized [11:19:14] I am not sure why we get that config on netbox1003, the code looks fine [11:19:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:21:15] ok I see now [11:21:27] in netbox standalone hiera config we have [11:21:38] profile::netbox::cas_server_url: "%{lookup('apereo_cas.staging.oidc_endpoint')}" [11:21:44] meanwhile for netbox "prod" [11:21:59] profile::netbox::cas_server_url: "%{lookup('apereo_cas.production.base_url')}" [11:22:12] yeah [11:22:42] but is it ok like this? [11:22:51] because it explains why we have those weird urls [11:23:04] yeah it's the old auth scheme [11:23:05] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1055893 [11:23:50] +1ed [11:24:02] thx, waiting for PCC [11:25:38] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:27:46] elukey: still not out of the woods :( [11:28:25] okok one step at the time :) [11:28:43] yup :) [11:29:01] I don't manage to get anything loading for netbox.w.o though [11:29:03] you? [11:29:10] ah ok now yes [11:29:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:29:25] ok this one seems more clear [11:30:28] yeah. that's the ACL stuff I was talking about [11:31:09] do you know any pointer to see where are the acls configured? [11:31:16] nop :( [11:31:27] is the token coming from hieradata/role/common/idp.yaml in the private repo? [11:31:34] other than https://apereo.github.io/cas/6.6.x/installation/Troubleshooting-Guide.html#application-not-authorized [11:33:47] eh https://gerrit.wikimedia.org/r/c/operations/puppet/+/1037506 [11:34:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:35:12] elukey: ohhh https://gerrit.wikimedia.org/r/c/operations/puppet/+/924895 [11:36:02] are we also switching from cas to oidc in this upgrade? Or am I misunderstanding? [11:36:21] ah nice! [11:36:41] XioNoX: isn't that what I mentione dabove? seems fine [11:36:57] volans: public repo, but yeah [11:37:22] but let's backtrack a second [11:37:23] hmm, netbox_oidc is already there [11:37:33] what I was trying to say :D [11:37:48] XioNoX: yeah but profile::netbox::oidc_service: 'netbox' [11:37:55] it is not _oidc [11:38:39] also it doesn't make a lot of sense to me that netbox[1,2]003 are trying to use CAS, and erorring out for it [11:38:51] if we are setting up oidc, that IIUC is bitu (so different) [11:39:02] or am I misunderstanding? [11:39:15] hence my question about "are we migrating away from cas to oidc as well?" [11:39:48] elukey: yeah we're setting up oidc, link on -next [11:39:49] like* [11:41:47] alright, it works! [11:42:04] with netbox_oidc? [11:42:10] nice catch [11:42:10] yeah, manually tested [11:42:16] the username is the uuid though [11:42:37] ahahha yes [11:42:46] and there are 0 permissions [11:43:12] no sync from ldap? [11:43:24] it should :) [11:44:10] just the mapping of UID and such [11:44:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:45:07] we had all that in ldap.py in /etc/netbox in the past [11:45:12] not sure if still used with the new setup [11:45:38] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:45:55] * volans has to go for lunch, ping if needed [11:46:31] XioNoX: maybe we should stop for say ~30 mins to eat something and then restart? [11:49:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:50:15] elukey: sure [11:51:19] ack! bbiab [11:54:21] 10netops, 06Infrastructure-Foundations, 06SRE: Set Leaf switches in Codfw rows C & D to active and make new vlans live - https://phabricator.wikimedia.org/T370629 (10cmooney) 03NEW p:05Triage→03Medium [11:59:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:00:38] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:04:15] XioNoX: back, any luck? [12:04:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:05:08] if it worked for netbox-next it should now as well [12:05:25] elukey: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1055901 [12:06:06] comparing it with the other OIDC services, that config knob is missing [12:07:00] XioNoX: but how did it work for netbox-next then? [12:07:25] elukey: netbox-next is on hieradata/role/common/idp-test.yaml [12:08:01] https://github.com/wikimedia/operations-puppet/blob/production/hieradata/role/common/idp_test.yaml#L38 [12:08:39] +1ed okok [12:08:41] let's try [12:09:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:11:47] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630 (10cmooney) 03NEW p:05Triage→03Medium [12:14:06] deployed, but doesn't solve it, maybe some caching? [12:15:38] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:19:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:20:58] did you run puppet on both idp nodes? Not sure if they need a reload [12:24:52] elukey: yeah, tomcat reloaded as well [12:25:54] ack okok [12:26:53] XioNoX: mmm maybe we need to restart netbox as well? [12:27:12] good idea, trying [12:27:32] no luck [12:30:46] we do have redis caching, that could play a role [12:30:56] not sure what it caches though [12:30:58] * volans back [12:33:39] dunno what changed, but now I have access [12:33:43] user permissions are back [12:33:55] my username is still an UUID though [12:34:32] it means you got the groups [12:34:53] but there is something wrong with the mapping yet [12:35:01] see my user: https://netbox.wikimedia.org/users/users/326/ [12:35:08] got wmf and nda [12:35:18] yeah [12:35:24] not sure why I see admin panel though [12:35:29] should be ops only [12:35:37] sorry I got ops, misread [12:35:41] ops and wmf, my bad [12:36:14] so it's creating new users, not matchign the existing ones (so missing any saved preferences) and not setting name and email [12:39:17] ldap.py seems the same across the netbox nodes [12:39:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:40:18] volans, elukey, alright [12:40:23] I deleted my own UUID user [12:40:31] and now I'm in with my regular one [12:40:50] I'm going to delete the other 3 faulty ones [12:41:07] trying [12:41:18] wow, weird [12:41:32] so does it matches first by ID and then by UID? [12:41:42] how do you delete uuid? [12:41:55] from https://netbox.wikimedia.org/users/users/ [12:42:13] 10netops, 06Infrastructure-Foundations, 06SRE: Add data to automation for new switches in codfw C/D - https://phabricator.wikimedia.org/T369106#10002128 (10cmooney) [12:42:13] I can confirm I'm back with my old user [12:42:16] same token, etc... [12:42:19] history [12:42:34] yep same [12:42:55] given the time we might want to notify the dc-ops chan that work is still in progress? [12:44:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:45:04] volans: yeah, in theory we're almost good [12:45:25] *just* testing all scripts and reports and cookbooks... :D [12:45:38] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:45:48] and deploying homer :) [12:45:55] and testing it :D [12:46:03] so yeah maybe it will take another while [12:46:05] first figuring out why https://netbox.wikimedia.org/core/jobs/ is stuck :) [12:48:40] I've restarted some failed units: [12:48:41] - wmf_auto_restart_uwsgi-netbox-scriptproxy.service [12:48:46] - wmf_auto_restart_rq-netbox.service [12:48:51] -netbox_housekeeping.service [12:48:58] the last one did [12:48:58] [*] Checking for expired jobs Deleting 44 expired records... Done. [12:49:02] let's see if it helps [12:49:19] or maybe are just the very old ones [12:49:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:50:38] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:52:05] XioNoX: do you still have patches to merge for netbox-extras? [12:52:23] volans: no everything should have been rebased [12:52:30] ok then sending a fix [12:52:33] and deployed on 1003/2003 [12:53:02] volans: for ganeti-netbox-sync.py ? [12:53:12] roles -> device_roles, yes [12:53:31] the other way around? [12:53:59] ah, no, I see [12:54:03] API URL [12:54:38] https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/1055924 [12:55:58] volans: please deploy it manually and not with the cookbook otherwise it might push it to the 1/2002 hosts [12:56:32] becaus eof the cumin alias ack [13:03:50] the jobs being stuck is most likely related to redis, but no idea how [13:04:04] have you checked ferm? [13:04:10] yeah, it's all open [13:04:16] and users on redis? [13:04:25] tcpdump show traffic [13:05:05] port/pw/db/etc seem correct [13:05:10] and didn't change since before [13:07:46] https://www.irccloud.com/pastebin/KOqcpaKc/ [13:09:11] ack [13:11:10] of course this worked out of the box on netbox-next :) [13:11:45] you already tried to bounce rq-netbox I guess [13:13:35] python manage.py rqstats [13:13:44] shows 69 failed in the default queue [13:13:44] volans: I'm stupid :) [13:13:47] lol [13:13:48] volans: it was stopped [13:13:53] ahaahhahaah [13:14:20] FIRING: [24x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:15:38] FIRING: [24x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:15:47] adding the scripts and reports [13:15:56] for the reports we have to change runreport to runscript [13:16:05] in modules/profile/manifests/netbox.pp [13:17:13] XioNoX: ^^^ [13:17:19] not sure if is part of your patches already [13:17:46] nop, missed that [13:17:55] failed 82, finished 8 [13:18:01] something is moving on teh queue side :D [13:18:15] the ganeti ones need access to the ganeti endpoint [13:18:36] for example: ConnectTimeout: HTTPSConnectionPool(host='ganeti-test01.svc.codfw.wmnet' [13:18:43] smells like ferm [13:18:55] GetDeviceStats seems to be hamering it [13:19:20] FIRING: [24x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:19:31] yeah, sending patches [13:20:20] basically git grep netbox1002 and 2002 [13:20:25] to be replaced with 3 [13:20:38] FIRING: [24x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:20:46] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#10002220 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=67fa4e46-b51e-42b8-9853-92735f7f0f85) set by ayounsi@cumin1002 for 2 days, 0:00:00 on 1 host(s) and their... [13:21:07] yep, downtiming 1/2002 now that 1/2003 is doing better [13:21:09] for the cumin alias XioNoX I think we should hardcode the *3 ones for now [13:21:13] ack [13:21:16] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#10002221 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b31b9f0d-62c5-41f6-9791-fca68557c987) set by ayounsi@cumin1002 for 2 days, 0:00:00 on 1 host(s) and their... [13:21:21] $ sudo cumin A:netbox [13:21:21] 4 hosts will be targeted: [13:22:49] anything that I can help with? [13:25:38] FIRING: [23x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:26:39] * volans has a meeting in few minutes [13:29:00] elukey: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1055932 [13:29:20] FIRING: [23x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:30:38] FIRING: [23x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:33:57] LGTM! [13:34:01] thx [13:39:20] FIRING: [23x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:39:30] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: Update codfw LVS connectivity to support new LSW in rows C & D - https://phabricator.wikimedia.org/T370635 (10cmooney) 03NEW p:05Triage→03Medium [13:40:38] FIRING: [23x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:44:20] FIRING: [23x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:45:35] elukey: another one to fix my previous precipitation https://gerrit.wikimedia.org/r/c/operations/puppet/+/1055940 [13:45:38] FIRING: [23x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:45:56] elukey: $netbox_extras_path is also used for the systemd timers [13:47:33] elukey: note that puppet is disabled on netbox1002 in case we need to rollback [13:49:20] FIRING: [23x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:50:05] * volans back until next meeting [13:51:07] XioNoX: didn't we have something that create /srv/netbox/customscripts? How did we do it for netbox-next? [13:52:00] elukey: yeah we did, just cleaned it up in the previous CR : https://gerrit.wikimedia.org/r/c/operations/puppet/+/1055932/2/modules/profile/manifests/netbox.pp [13:52:16] ahhh right and I didn't notice, sigh [13:52:29] we had `if $deploy_project == 'netbox-dev'` which we don't need anymore [13:55:08] FIRING: [23x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:59:55] FIRING: [22x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:00:38] FIRING: [22x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:04:34] `sudo tail -f /srv/log/netbox/main.log` is being flooded logs [14:09:20] FIRING: [17x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:14:20] FIRING: [16x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:15:38] FIRING: [16x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:19:20] FIRING: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:24:35] from himself it seems (looking at apache logs) [14:49:16] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#10002716 (10joanna_borun) p:05Triage→03High a:03cmooney [14:54:37] XioNoX: shout if you need help [14:54:42] volans: so yeah I don't get what's going on with the scripts, they run fine, but [14:55:06] volans: first stuff that run using the API run as "gehel"'s username :) [14:55:13] see https://netbox.wikimedia.org/core/jobs/ [14:55:36] wut? [14:55:57] then even if they complete, they show up in https://netbox.wikimedia.org/extras/scripts/ as never ran [14:56:08] like the capirca one, or the acccounting [14:57:17] rqstats agrees [14:57:20] on the failures [14:59:52] oh accounting completed for the first time, what changed? [15:02:20] I asked :D [15:03:15] volans: like for example https://netbox.wikimedia.org/extras/scripts/#module16 "PhysicalHosts" [15:03:48] https://www.irccloud.com/pastebin/EZAt2xGp/ [15:04:04] so it runs fine, but doesn't save the output [15:05:37] volans: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1055957 not going fix it, but still better [15:07:08] volans: I get "Reverse for 'managedfile' not found. 'managedfile' is not a valid view function or pattern name." when trying to remove the 2nd datasource. [15:07:30] volans: unless you're on something else, can I try to delete all the scripts/reports and import them all back? [15:08:50] XioNoX: go for it, I'll be there to look in 5~10 sorry [15:09:06] suggestion [15:09:37] find . -name *.pyc (then with -delete) and same for __pycache__ [15:11:03] I am here if needed XioNoX [15:11:20] <3 : [15:11:21] :) [15:34:35] volans: "LookupError: App 'extras' doesn't have a 'JobResult' model." [15:34:47] I think that's the root of the issue [15:37:06] for later: https://phabricator.wikimedia.org/P66886 [15:37:20] ack [15:37:49] so rollback is just pointing back discovery to the old hosts, re-enable puppet and disable maintenance mode? or some more changes in puppet are needed? [15:38:10] (cumin alias comes to mind, but maybe some changes did replace old with new and needs revert) [15:38:13] yeah, and a couple puppet rollback, but puppet is disabled on netbox1002 so that can happen later on [15:43:34] volans: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1055968 [15:44:17] +1ed [15:44:20] FIRING: [10x] SystemdUnitFailed: netbox_report_cables_run.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:44:24] where's the runbook of the migration? [15:44:28] if I have to follow along :) [15:44:36] volans: for the rollback? [15:44:54] no for rollforward [15:44:58] to step back on it [15:45:03] and check CRs [15:45:07] if they need revert or not [15:45:10] https://etherpad.wikimedia.org/p/netbox4-upgrade [15:45:29] not all CRs are on it though [15:45:38] FIRING: [10x] SystemdUnitFailed: netbox_report_cables_run.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:46:36] great :D [15:46:37] it's a start [15:47:20] XioNoX: I can run puppet on dns nodes if you want [15:47:25] elukey: cool, thx [15:48:16] there is also the netbox-extra repo that has been rebased with the dev branch fixe [15:48:45] and deployed? [15:48:49] if not deployed not urgent [15:49:01] not deployed on the old hsts [15:49:27] last commit c1896766f195a55279712570f64910ea2a1abdf7 Thu Jul 4 09:37:09 [15:49:30] ok not urgent [15:51:49] elukey: merged in puppet [15:54:08] volans: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1055970 [15:54:36] the last real thing is the netbox-extra path on the frontends [15:54:49] done [15:56:37] XioNoX: started [16:05:12] elukey: netbox.wo points to the old UI now [16:05:51] still v4 for me [16:06:02] still in progress [16:08:13] Of course now I think I know what went wrong :) [16:08:37] I should have deleted all the previous Jobs [16:08:41] rotfl, tell us [16:08:46] it was trying to re-run something [16:08:47] all the jobs from < 4 [16:08:48] in a loop? [16:09:01] there must be just a few, didn't they fail and that's it? [16:09:05] yeah, going over the old results, and the pointers were not valid anymore [16:09:37] no idea, but it was also something I did on -dev but the error message was more explicit [16:09:57] like it couldn't open the page for jobs or something like that [16:10:17] try to delete them on the new UI and if all fixes there is still the choice to roll forward [16:10:27] but I'll not be around for long [16:13:33] I'll try it, but indeed it's getting late for a roll forward [16:14:20] FIRING: [10x] SystemdUnitFailed: netbox_report_cables_run.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:15:38] FIRING: [10x] SystemdUnitFailed: netbox_report_cables_run.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:16:25] so first, netbox.wikimedia.org should be fully back on v3 [16:16:33] super [16:16:35] checking [16:16:39] so I can unblock everyone [16:16:52] cumin completed on dns nodes [16:16:56] and I can confirm netbox3 [16:17:00] I'll keep puppet disabled there [16:17:01] puppet still disabled [16:17:03] on 1002 [16:17:22] yeah, I'll keep it disabled until tomorrow if we can do the actual upgrade tomorrow [16:17:28] ack [16:17:36] if more investigation is needed I'll puppetize the proper changes [16:17:37] and the timers are all running [16:17:39] ok [16:31:30] XioNoX: https://netbox.wikimedia.org/extras/scripts/results/6051554/ [16:31:35] doens't sem to run [16:32:32] shoudl we cleanuop queues? is there a manage.py command for that? [16:32:33] volans: good catch, I stopped rq-netbox there [16:32:38] ah ok [16:32:39] easier :D [16:33:00] checking if it runs after you restart it [16:33:13] it should be back [16:34:27] mmmh still pending [16:34:36] or it's going through the queue of past jobs [16:34:38] not suer [16:35:22] rqstats shows just 2 [16:37:00] maybe the job needs to be restarted after rq-netbox is started? [16:38:03] try to run a new one [16:38:11] id 6051606 [16:39:20] FIRING: [10x] SystemdUnitFailed: netbox_report_cables_run.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:40:20] no luck so far [16:41:56] Sorry, I have to run as soon as the meeting ends. [16:42:12] feel free to delete pending/started jobs [16:43:38] volans: https://netbox.wikimedia.org/extras/scripts/results/6051630/ [16:43:42] looks like it completed? [16:43:57] https://netbox.wikimedia.org/extras/scripts/results/6051606/ [16:43:59] last one [16:44:20] FIRING: [10x] SystemdUnitFailed: netbox_report_cables_run.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:44:44] volans: that one is older, no? [16:45:15] last one I started, 7m ago [16:45:17] I think [16:45:25] try a new one [16:45:28] if it works [16:45:44] gotta go, sorry [16:46:54] volans: yep, all good https://netbox.wikimedia.org/extras/scripts/results/6051645/ [16:46:57] it runs instant [16:54:20] FIRING: [11x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:55:38] FIRING: [11x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:04:20] FIRING: [10x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:09:19] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#10003767 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6e8b1723-decb-4086-9785-376414b41d2c) set by ayounsi@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their s... [17:09:20] FIRING: [10x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:09:38] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#10003769 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b306ac42-cfcf-4095-a53e-80b1fd183949) set by ayounsi@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their s... [17:53:09] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: Update codfw LVS connectivity to support new LSW in rows C & D - https://phabricator.wikimedia.org/T370635#10004023 (10cmooney) [17:53:11] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10004024 (10cmooney) [21:26:06] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#10005058 (10cmooney) OK I have allocated 10.195.1.0/25 in Netbox and configured 10.195.1.1 as a secondary IP on pfw3-codfw on the mgmt... [22:39:17] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#10005223 (10bking) `elastic110[0-2]` are banned and ready , as is `wdqs1016`.