[00:03:43] FIRING: [2x] SystemdUnitCrashLoop: rq-netbox.service crashloop on netbox1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [00:30:41] FIRING: [4x] SystemdUnitFailed: netbox_report_coherence_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:35:48] RESOLVED: PuppetConstantChange: Puppet performing a change on every puppet run on netbox1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [04:03:43] FIRING: [2x] SystemdUnitCrashLoop: rq-netbox.service crashloop on netbox1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [04:34:23] FIRING: [4x] SystemdUnitFailed: netbox_report_coherence_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:59:07] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Netbox: replace getstats.GetDeviceStats with ntc-netbox-plugin-metrics-ext - https://phabricator.wikimedia.org/T311052#10047371 (10ayounsi) Data is back on https://grafana.wikimedia.org/d/ppq_8SRMk/netbox-device-statistic-breakdown Only the cleanup... [07:03:57] 10CAS-SSO, 10Beta-Cluster-Infrastructure, 10Bitu, 06cloud-services-team, and 2 others: Wikitech system account and SUL for Jenkins agents? - https://phabricator.wikimedia.org/T371930#10047378 (10SLyngshede-WMF) ` 2024-08-06 20:10:04,149 WARN [org.apereo.cas.util.function.FunctionUtils] - 10CAS-SSO, 10Beta-Cluster-Infrastructure, 10Bitu, 06cloud-services-team, and 2 others: Wikitech system account and SUL for Jenkins agents? - https://phabricator.wikimedia.org/T371930#10047381 (10SLyngshede-WMF) CAS uses the following to lookup the user: ` cas.authn.ldap[0].basedn=dc=wikimedia,dc=org cas.a... [07:15:50] 10CAS-SSO, 10Beta-Cluster-Infrastructure, 10Bitu, 06cloud-services-team, and 2 others: Wikitech system account and SUL for Jenkins agents? - https://phabricator.wikimedia.org/T371930#10047401 (10SLyngshede-WMF) While I don't have the password, I've tested authenticating as jenkin-deploy on idp-test2004, an... [07:16:36] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Netbox: replace getstats.GetDeviceStats with netbox-more-metrics - https://phabricator.wikimedia.org/T311052#10047402 (10ayounsi) [07:26:39] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Netbox: get rid of WMF Production Patches - https://phabricator.wikimedia.org/T310717#10047431 (10ayounsi) 05Open→03Resolved a:03ayounsi All is done ! [08:03:43] FIRING: [2x] SystemdUnitCrashLoop: rq-netbox.service crashloop on netbox1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [08:05:47] elukey: hello, can you review/deploy my patch to homer git::clone which explicitly set the `umask` https://gerrit.wikimedia.org/r/c/operations/puppet/+/1056985 :) [08:08:22] hashar: o/ I'll do it this morning, sorry for the lag but I was a bit busy with weird bugs in the past days :) [08:09:19] that is what I thought :-] hence the gentle reminder hehe [08:17:38] argh I have send the wrong series [08:21:10] I have rebased my local series which simply rebased the change above [08:21:15] so it is still good to go [08:34:23] FIRING: [4x] SystemdUnitFailed: netbox_report_coherence_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:41:26] 10SRE-tools, 06Infrastructure-Foundations: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#10047547 (10elukey) Rolled out the change to the hadoop cluster, this is the only error that I got: ` [2024-08-07T08:38:59] Unable to update host 'an-worker110... [08:45:17] 10CAS-SSO, 10Beta-Cluster-Infrastructure, 10Bitu, 06cloud-services-team, and 2 others: Update basedn in CAS - https://phabricator.wikimedia.org/T371930#10047550 (10SLyngshede-WMF) a:03SLyngshede-WMF [08:51:11] 10CAS-SSO, 10Beta-Cluster-Infrastructure, 10Bitu, 06cloud-services-team, and 2 others: Update basedn in CAS - https://phabricator.wikimedia.org/T371930#10047573 (10SLyngshede-WMF) p:05Triage→03Medium We've tested modifying the basedn on test and @hashar confirms that login is now working. [08:55:14] hey folks! [08:55:26] I am rolling out debmonitor-client 0.4.0 fleetwide [08:55:59] after this we should finally have only "Debian XX" in debmonitor, rather than "Debian" only [09:23:58] 10netbox, 06Infrastructure-Foundations: Netbox rq.timeouts.JobTimeoutException - https://phabricator.wikimedia.org/T341843#10047614 (10elukey) [09:38:49] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#10047634 (10ayounsi) Notes from the Debrief meeting What went wrong ? * Too optimistic :) * Huge quantity of breaking changes, some undocumented (inc. required an upgrade to boo... [09:39:05] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#10047640 (10ayounsi) 05Open→03Resolved a:03ayounsi [09:41:24] notes from the Netbox 4 debrief - https://phabricator.wikimedia.org/T336275#10047634 [09:43:09] 10netbox, 06Infrastructure-Foundations: Decom Netbox 3 servers - https://phabricator.wikimedia.org/T371957 (10ayounsi) 03NEW p:05Triage→03Low [09:43:15] and netbox 3 decom task https://phabricator.wikimedia.org/T371957 [09:50:34] elukey: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1060403 can I get a quick review on this? [09:53:08] done :) [09:53:13] thx! [10:07:04] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Decom Netbox 3 servers - https://phabricator.wikimedia.org/T371957#10047715 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1002 for hosts: `netbox2002.codfw.wmnet` - netbox2002.codfw.wmnet (**PASS**) - Downtimed h... [10:25:46] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Decom Netbox 3 servers - https://phabricator.wikimedia.org/T371957#10047733 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1002 for hosts: `netbox1002.eqiad.wmnet` - netbox1002.eqiad.wmnet (**PASS**) - Downtimed h... [10:38:09] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Decom Netbox 3 servers - https://phabricator.wikimedia.org/T371957#10047806 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1002 for hosts: `netboxdb1002.eqiad.wmnet` - netboxdb1002.eqiad.wmnet (**PASS**) - Downtim... [10:50:06] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Decom Netbox 3 servers - https://phabricator.wikimedia.org/T371957#10047849 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1002 for hosts: `netboxdb2002.codfw.wmnet` - netboxdb2002.codfw.wmnet (**PASS**) - Downtim... [12:01:26] the puppet compiler shows me obsolete nodes which got deleted recently but that causes a compilation to fail [12:02:14] https://puppet-compiler.wmflabs.org/output/927986/1626/ shows deploy1002.eqiad.wmnet which was removed on August 1st and deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud which was also dropped on August 1st [12:02:50] for the later, I have confirmed on the deployment-prep puppet server that the facts are no more uploaded for deployment-deploy03 [12:03:41] so I guess both hosts have to be purged but according to https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Purging_nodes , "node is considered active if it has submitted a report to puppetdb in the last 14 days" [12:03:55] and thus I guess both will be ghosts until they expire [12:17:27] I think that can be ignored [12:17:45] the patch I have made to remove umask from git::clone is ready : https://gerrit.wikimedia.org/r/c/operations/puppet/+/927986/ [12:18:00] and I am not sure whom to add since it touched various bit of the puppet code [12:28:56] elukey: https://github.com/netbox-community/pynetbox/pull/632#pullrequestreview-2224665097 :) [12:30:00] you're now an official contributor :) [12:31:59] nice! [12:34:23] FIRING: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:34:57] elukey: unrelated but I also opened this https://github.com/TheDJVG/netbox-more-metrics/issues/34 (cc slyngs, godog) [12:59:18] hashar: I merged the umask change, but https://gerrit.wikimedia.org/r/c/operations/puppet/+/1056981 still looks not ok to me - It is not a matter of if a repo is public or not, we can allow others to read but we shouldn't allow for read-only repos (containing code) write permissions [12:59:47] it is (in my opinion) a basic protection to avoid unwanted side effects [13:05:15] elukey: yeah I have send another patch for that [13:05:44] and my guess is we should change git::clone to default to 644 (ie without writable bit) [13:05:53] but that is another long series of changes :-] [13:06:23] thanks for the merge! I am happy to see `umask` is gone! [13:07:46] +1 for 644 yes! [13:07:53] thank you for the refactoring :) [13:20:41] 10SRE-tools, 06Infrastructure-Foundations: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#10048179 (10elukey) Buster and Bookworm rollouts done, no big issues registered. The only drawback is that due to the high volume of writes to the db (since we... [13:27:46] hashar: did something changed with CI recently? it's now reporting tons of errors that were not being reported before: https://integration.wikimedia.org/ci/job/tox/1833/console [13:28:10] XioNoX: cheers, I've subscribed [13:39:23] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Netbox: use Custom Model Validation - https://phabricator.wikimedia.org/T310590#10048268 (10ayounsi) a:03ayounsi Sent the patches for the last few ones left in the task description. There are also https://gerrit.wikimedia.org/r/c/operations/softw... [13:44:27] 07Puppet, 06Release-Engineering-Team, 13Patch-For-Review: Puppet git::clone probably does not need `umask` parameter - https://phabricator.wikimedia.org/T338277#10048286 (10hashar) 05Open→03Resolved The series of patch has led to the removal of `umask` from `git::clone` In roughly the order the patc... [13:44:54] XioNoX: yeah I have upgraded the base python image that is running tox [13:45:26] which added support for python 3.10 to 3.12 [13:46:23] as for why propsector fails solely under 3.12 .. I have no clue :/ [13:46:33] hashar: hmmm, is it possible to manually use the older version ? Looks like the package version change is causing issues :( [13:49:54] looks like pylint crashed under Python 3.12 [13:55:52] hashar: any pointers on what I should do? [14:00:24] XioNoX: yes remove py312 from the list of environments in tox.ini [14:01:02] ok! [14:01:04] envlist=py{39,311,312}-{flake8,bandit,mypy,prospector} [14:01:05] ^^^ [14:01:49] looks like `customscripts/import_server_facts.py` uses a syntax that ends up crashing astroid [14:04:12] yeah [14:04:13] https://github.com/pylint-dev/astroid/issues/2201 [14:04:20] or https://github.com/pylint-dev/pylint/issues/8782 [14:04:32] pylint needs to be upgraded in order to support python 3.12 [14:04:34] XioNoX: ^ [14:05:00] I reproduce it with `tox -e py312-prospector` [14:05:46] hashar: nice find! can you update it in the image or it's more complicated? [14:05:56] pylint? [14:06:10] yeah [14:06:14] the dependency is defined in netbox-extras in the tox.ini file [14:06:48] well actually pylint is a transitive dependency of prospector [14:10:06] the short story is prospector last release was cut more or less when python 3.12 got released [14:10:19] and it does not support 3.12 , there are some pull requests at https://github.com/landscapeio/prospector/pulls [14:14:50] wow, ok [14:15:12] thx for looking into it! [14:15:33] but I agree it is a bit surprsiing I should drop an announce [14:22:05] 07Puppet, 06Infrastructure-Foundations, 06Release-Engineering-Team: Puppet git::clone should default mode to 0644 (read-only) instead of 0755 - https://phabricator.wikimedia.org/T371980 (10hashar) 03NEW [14:22:14] elukey: ^ :) [14:22:30] I'll do it eventually one day [14:22:38] <3 [14:23:04] after that I remove all uses of git::clone in favor of deploying with scap [14:23:08] * hashar vanishes [16:34:23] FIRING: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:36:11] ^ reason: ConnectionError: HTTPSConnectionPool(host='storage.googleapis.com [16:37:36] !log puppetserver1002 systemctl start dump_ip_reputation [16:37:36] mutante: Not expecting to hear !log here [16:39:23] RESOLVED: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:52:49] 10netops, 06Infrastructure-Foundations, 06SRE: cloudsw1-d5-eqiad instability Aug 6 2024 - https://phabricator.wikimedia.org/T371879#10049699 (10Dzahn) We got paged at 20:19 UTC for "primary outbound port utilisation over 80%" on both cloudsw1-d5 and cloudsw1-f4 today. Shortly after it resolved. But somethi... [20:54:57] 10netops, 06Infrastructure-Foundations, 06SRE: cloudsw1-d5-eqiad instability Aug 6 2024 - https://phabricator.wikimedia.org/T371879#10049702 (10Dzahn) {F57154133} [21:13:55] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: New hosts with "Netbox status: unknown" - https://phabricator.wikimedia.org/T371653#10049729 (10Dzahn) Same just happened on gerrit1004 [21:27:15] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: New hosts with "Netbox status: unknown" - https://phabricator.wikimedia.org/T371653#10049757 (10Dzahn) 05Resolved→03Open Same for host vrts1003. It seems everything works except the cookbook can't set the status in netbox. I manually changed th...