[02:48:29] (SystemdUnitFailed) firing: (5) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:48:29] (SystemdUnitFailed) firing: (5) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:14:42] (SystemdUnitFailed) firing: (5) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:23:42] interesting find, we changed a bit ago the authorized key file for cumin to use 'restricted' insead of listing a bunch of no-$NAME. But 'restrict' includes also "disabling PTY allocation", so that means no scp nor interactive SSH. Is there any concern to re-enable it chaging the line with 'restrict,pty'? [07:36:11] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [07:53:53] volans: Not completely sure of the implications, but of curiosity why do we need pty allocation? [07:55:50] it can be useful to debug things at times, transfer small files between hosts [07:56:03] using SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh (or scp) [08:21:52] I wonder if the PTY is still required after scp switched to sftp internally [08:22:55] I noticed because I was trying it and hanged :D [08:23:06] sorry, ssh hangs, scp fails [08:23:22] Then it's required :-) [08:26:26] sftp works: SSH_AUTH_SOCK=/run/keyholder/proxy.sock sftp [08:26:35] unless is somethign else, I didn't try to add it and see if it unblocks [08:28:25] Okay, so the Debian scp command isn't new enough to use sftp internally [08:30:22] :D [08:30:29] it's a feature [08:31:38] https://www.openssh.com/txt/release-9.0 [08:32:15] There's a way to have the new scp use the old protocol, but there doesn't appear to be a way to force sftp on older clients. [08:58:29] (SystemdUnitFailed) firing: (4) rq-idm.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:01:07] can someone silence the idm-test alerts? :) [09:01:29] XioNoX: I can do better, I fixed them [09:01:53] thx :) [09:01:55] Not really sure why that last one showed up, it alerted as I rolled out the fix [09:02:15] AM bug [09:02:16] Anyway, it's working now. They annoy me to [09:03:29] (SystemdUnitFailed) firing: (4) rq-idm.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:04:19] Oooookay [09:04:43] https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed&q=%40receiver%3Dinfrastructure-foundations-irc [09:06:04] Yeah, I was look at that one as well, it's not there [09:07:16] there is one for eqiad [09:07:25] summary: wmf_auto_restart_apache2-htcacheclean.service Failed on idm-test1001:9100 [09:07:36] not the same one though [09:07:57] yeah I think it's a bug in AM? [09:08:00] godog: ^ [09:09:02] Yes, that one fails across a number of hosts [09:10:41] FYI I'll be playing with netbox-next so if anything happens there you know who's fault is ;) [09:12:48] XioNoX: will take a look shortly [09:13:15] thx! [09:13:29] (SystemdUnitFailed) resolved: (2) rq-idm.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:22:19] godog: The service is called: wmf_auto_restart_apache-htcacheclean but it says wmf_auto_restart_apache2-htcacheclean in alertmanager [09:26:53] Okay, so it was rolled out this the wrong name, and the incorrect service wasn't removed [09:28:03] slyngs: cheers, ok so things were "wrong" in puppet due to left over state but otherwise working as intended (?) [09:28:35] If you do a git log on modules/profile/manifests/idm.pp 11:28:29 [09:28:45] You can see that it's being rename [09:29:07] I'm just testing out on my test server if prometheus will pick up if it's being disabled [09:29:40] ah ok got it! thank you for the context [09:30:20] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) Just noticed what has been probably in the radar for @cmooney for some time now: [[https://ne... [09:30:23] I doubt it FWIW, most of the time removing resources from puppet means they are left as is on the filesystem [09:30:37] I'm currently trying a systemctl disable and reset-failed on idm-test1001, let's see if that clears the alert in alertmanager [09:31:25] pretty sure it will yeah, for the auto-restarts there are also timers to remove/disable FWIW [09:32:18] Oh, right [09:33:06] That fixed the issue. I'll just clean up the remaining three hosts [09:33:13] https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed&q=name%3Dwmf_auto_restart_apache2-htcacheclean.service [09:34:05] sweet, thank you [09:35:45] slyngs: re wmf_auto_restart_apache2-htcacheclean.service from what i can tell this is an old unit left over from a badly named define. the real unit should be wmf_auto_restart_apache-htcacheclean.service. i clean up the old files manually on idm[12] [09:36:10] exactly [09:36:40] It got rolled out to arclamp1001 and arclamp2001 as well. I'll just remove them there as well [09:36:42] however its unclear to me if you actully need apache-htcacheclean afaik uits disabled buyt default and i dont see anything managing that service so if yuo dont need it it be tempted to also remove wmf_auto_restart_apache-htcacheclean [09:36:50] ack [09:37:20] fyi to remove i thik you can systemctl stop && systemctl disable && systemctl reset-failed [09:37:35] Yep, that's what I've done :-) [09:37:58] cool :) [09:38:36] I think maybe certain deb packages with Apache modules will enable the htcacheclean serviceand the auto_restart was added to ensure that the service would restart on package updates [09:40:18] slyngs: i think my point is either manage both the service apache-htcacheclean and the auto restart or neither [09:40:55] right now you manage the aytopr restart but not the service hich means you can get the machine in a strange state e.g. on reboot [09:40:58] loaded (/lib/systemd/system/apache-htcacheclean.service; disabled; vendor preset: enabled) [09:41:05] from idm-test notice disabled [09:41:48] Okay, yes that makes sense, to keep them in sync [09:43:43] slyngs: it may also make senses to add this as a flag to the http class e.g. enable_htcacheclean instead of directly in the idm class [09:44:47] That would be better, it's used in a few other places as well [09:47:53] I'll just try my hand on a patch and send it your way [09:48:00] yes i did a quick check and it seems its running on 13/485 serveres with the httpd class and i only see the service been managed in one place (modules/profile/manifests/opensearch/api/httpd_proxy.pp) however that dosn;t have the auto-restart so i thik i couple of other things will definetly beneift [09:48:06] sure thing thanks [11:08:48] so... we have a small~ish problem [11:09:17] the latest wheels for netbox generate a 127M artifacts/artifacts.bullseye.tar.gz [11:09:21] gerrit has: [11:09:24] [receive] maxObjectSizeLimit = 100m [11:10:14] was 80M last time, I can check why is so big, but anyway we're close to the limit and need to re-think i [11:10:17] *it [11:13:26] ok weird, this might be a bug, we have duplicated wheels in the last generation, looking into it [11:13:32] that said 80 is close to 100 :D [11:16:06] 10CAS-SSO, 10netbox, 10Infrastructure-Foundations: Move Netbox authentication to python-social-auth - https://phabricator.wikimedia.org/T308002 (10SLyngshede-WMF) [11:18:31] 10CAS-SSO, 10netbox, 10Infrastructure-Foundations: Move Netbox authentication to python-social-auth - https://phabricator.wikimedia.org/T308002 (10SLyngshede-WMF) I currently have a patch with the python-social-auth project to enable OIDC via CAS in the django-social-auth plugin. It still needs documentation... [11:49:00] 10netops, 10Infrastructure-Foundations, 10SRE: Automate Netbox additions for new spine/leaf L3 networks. - https://phabricator.wikimedia.org/T333441 (10cmooney) p:05Triage→03Low [11:49:44] 10netops, 10Infrastructure-Foundations, 10SRE: Automate Netbox additions for new spine/leaf L3 networks. - https://phabricator.wikimedia.org/T333441 (10cmooney) [11:49:52] 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [12:41:05] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Is Vlan 2122 cloud-support1-b-codfw required? - https://phabricator.wikimedia.org/T327930 (10ayounsi) a:03ayounsi [12:46:27] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Is Vlan 2122 cloud-support1-b-codfw required? - https://phabricator.wikimedia.org/T327930 (10ayounsi) Removed from Netbox, last step is the above Puppet change ready for reviews. [12:48:10] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Spicerack: Spicerack: add network devices support - https://phabricator.wikimedia.org/T306552 (10ayounsi) [12:49:01] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Spicerack: Spicerack: add network devices support - https://phabricator.wikimedia.org/T306552 (10ayounsi) 05Open→03Resolved Closing this task as the short term goals are done, medium terms have their own task. [12:59:52] 10netops, 10Infrastructure-Foundations, 10SRE: Enable LLDP on SRX facing interfaces - https://phabricator.wikimedia.org/T320229 (10ayounsi) Enabled it on pfw3-codfw, and removed the exception on fasw-c-codfw and it's working as expected: ` pfw3-codfw# run show lldp neighbors Local Interface Parent Int... [13:00:04] 10netops, 10Infrastructure-Foundations, 10SRE: Enable LLDP on SRX facing interfaces - https://phabricator.wikimedia.org/T320229 (10ayounsi) a:03ayounsi [13:27:56] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) @aborrero cloudcontrol2004-dev is in a public VLAN that is what we didn't relocate it in B1. Bu... [13:39:32] hi folks [13:39:39] I am not sure there is a pattern here but I thought I will share in case there is [13:39:54] I have been noticing one-off Puppet failures over the last two days or so, example: https://puppetboard.wikimedia.org/report/lvs5005.eqsin.wmnet/01991cda8cac29625a8ef11c89e56a444cd2ea62 [13:40:08] do we know if some network issues are at play? [13:41:17] just sharing in case someone else noticed it as well [13:41:26] running agent fixed the issue fwiw [13:43:05] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Enable LLDP on SRX facing interfaces - https://phabricator.wikimedia.org/T320229 (10ayounsi) FYI, it's still needed to disable LLDP on switch interfaces facing the management routers. [13:56:29] sukhe: we are n a nmeeting at the moment ill take a look after we finish. one thing to say (before looking) is that yesterday afternoon after the switch upgrade puppet was unhappy for most of the afternoon which would result in a bunch of random faliures [13:56:46] jbond: ok thank you, not urgent for sure! [13:56:56] and if it helps, this one is from today (~30 mins ago I think) [13:57:08] but yeah like I said, another agent run cleared it up [13:57:27] ack then not related to yesterday :) ill take a look in a bit [13:57:40] <3 [13:57:51] just posting for awareness mostly [13:57:58] no pending issues on my end [14:13:29] (SystemdUnitFailed) firing: httpbb_kubernetes_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:15:28] 10netops, 10Infrastructure-Foundations, 10SRE: Is Vlan 2122 cloud-support1-b-codfw required? - https://phabricator.wikimedia.org/T327930 (10ayounsi) 05Open→03Resolved For the record, Netbox changes {F36932140} [14:17:39] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10aborrero) [14:23:31] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Proposal: create a framework to build containerized incident management protects - https://phabricator.wikimedia.org/T265153 (10jbond) [14:27:43] y/goAm [14:42:48] 10Puppet, 10netbox, 10Infrastructure-Foundations, 10SRE, and 3 others: Netbox: use the netbox to also sync networks - https://phabricator.wikimedia.org/T329669 (10jbond) FYI i have created a new implementation which produces the following data {P45977} > And this change is a good opportunity (while being... [14:42:52] 10netops, 10Infrastructure-Foundations, 10SRE: Automate Netbox additions for new spine/leaf L3 networks. - https://phabricator.wikimedia.org/T333441 (10cmooney) [14:43:27] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Use Junos BGP graceful-shutdown and shutdown features - https://phabricator.wikimedia.org/T320230 (10ayounsi) Upgrade doc updated: https://wikitech.wikimedia.org/w/index.php?title=Juniper_router_upgrade&diff=2064827&oldid=2016903 Receiver i... [14:49:59] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) @Papaul we're gonna reimage this one onto new vlans (will happen to all the public vlan ones i... [14:54:35] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, 10Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [15:03:33] hi, I'm getting a CI failure on dns.git (master is failing too), I'm wondering if you could help debugging: https://integration.wikimedia.org/ci/job/operations-dns-lint-docker/4533/console [15:05:35] godog: that's my bad [15:05:43] godog: this usually is because of removed prefixes in netbox [15:05:48] yeah [15:05:50] for which the related file included in master is gone [15:05:59] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, 10Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) We had a conversation about this today. Conclusions: * we will migrate the remaining of cloudvirts to single NIC, so the se... [15:06:11] hah, thank you that explains [15:06:22] delete mode 100644 2.2.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa [15:06:22] delete mode 100644 21.192.10.in-addr.arpa [15:06:22] anything I can do/help with to fix it ? [15:06:38] sending a patch [15:06:58] ack thanks [15:09:07] and to be clear, the changes in the exported dns repo from netbox should have not been merged without asking the people involved in that change [15:09:17] as that was merged as a spurious change with other "expected" changes [15:09:49] https://gerrit.wikimedia.org/r/c/operations/dns/+/904198 volans [15:10:47] +1ed [15:10:48] thx [15:11:47] sweet, thanks for the quick fix [15:12:27] godog: done! let me know if it's good now [15:13:27] yeah I think we're good, cheers XioNoX [15:13:29] (SystemdUnitFailed) resolved: httpbb_kubernetes_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:14:14] 10netops, 10Infrastructure-Foundations, 10SRE: Use Junos BGP graceful-shutdown and shutdown features - https://phabricator.wikimedia.org/T320230 (10ayounsi) 05Open→03Resolved a:03ayounsi [15:14:55] yeah all good, locally I get FileNotFoundError: [Errno 2] No such file or directory: '/usr/sbin/gdnsd' [15:15:02] hence "no unexpected errors" [15:15:35] godog: follow the instructions for running locally without gdnsd [15:16:02] thank you, jenkins passed and I'm happy as is [15:16:10] tox -- -n [15:28:10] Hey I/F team, could you please complete feedback form for Sprint week https://docs.google.com/forms/d/e/1FAIpQLScRESJOLI_6N5REhzkgtOYRFOZIFhvfiLmJErrjQqmTCczENg/viewform?usp=sf_link [15:31:50] "You've already responded" :-P [15:32:24] I sent an email instead [15:52:15] 10netops, 10Infrastructure-Foundations, 10SRE: Enable LLDP on SRX facing interfaces - https://phabricator.wikimedia.org/T320229 (10ayounsi) 05Open→03Resolved LLDP is now enabled on all the SRXs. > FYI, it's still needed to disable LLDP on switch interfaces facing the management routers. To expand on thi... [18:29:04] 10SRE-tools, 10Infrastructure-Foundations, 10PyBal, 10SRE, 10serviceops: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. - https://phabricator.wikimedia.org/T239392 (10BCornwall) [18:54:18] XioNoX: i know, but Leo would like to have everyone’s responses in the form as there are slightly different questions. [18:58:39] * lmata would appreciate the response in survey <3 [19:03:22] thanks!