[02:48:29] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:48:29] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:14:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:23:42] <volans>	 interesting find, we changed a bit ago the authorized key file for cumin to use 'restricted' insead of listing a bunch of no-$NAME. But 'restrict' includes also "disabling PTY allocation", so that means no scp nor interactive SSH. Is there any concern to re-enable it chaging the line with 'restrict,pty'?
[07:36:11] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi)
[07:53:53] <slyngs>	 volans: Not completely sure of the implications, but of curiosity why do we need pty allocation?
[07:55:50] <volans>	 it can be useful to debug things at times, transfer small files between hosts
[07:56:03] <volans>	 using SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh (or scp)
[08:21:52] <slyngs>	 I wonder if the PTY is still required after scp switched to sftp internally
[08:22:55] <volans>	 I noticed because I was trying it and hanged :D
[08:23:06] <volans>	 sorry, ssh hangs, scp fails
[08:23:22] <slyngs>	 Then it's required :-)
[08:26:26] <slyngs>	 sftp works: SSH_AUTH_SOCK=/run/keyholder/proxy.sock sftp
[08:26:35] <volans>	 unless is somethign else, I didn't try to add it and see if it unblocks
[08:28:25] <slyngs>	 Okay, so the Debian scp command isn't new enough to use sftp internally
[08:30:22] <volans>	 :D
[08:30:29] <volans>	 it's a feature
[08:31:38] <slyngs>	 https://www.openssh.com/txt/release-9.0 
[08:32:15] <slyngs>	 There's a way to have the new scp use the old protocol, but there doesn't appear to be a way to force sftp on older clients.
[08:58:29] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) rq-idm.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:01:07] <XioNoX>	 can someone silence the idm-test alerts? :)
[09:01:29] <slyngs>	 XioNoX: I can do better, I fixed them
[09:01:53] <XioNoX>	 thx :)
[09:01:55] <slyngs>	 Not really sure why that last one showed up, it alerted as I rolled out the fix
[09:02:15] <XioNoX>	 AM bug
[09:02:16] <slyngs>	 Anyway, it's working now. They annoy me to
[09:03:29] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) rq-idm.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:04:19] <slyngs>	 Oooookay
[09:04:43] <XioNoX>	 https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed&q=%40receiver%3Dinfrastructure-foundations-irc
[09:06:04] <slyngs>	 Yeah, I was look at that one as well, it's not there
[09:07:16] <XioNoX>	 there is one for eqiad
[09:07:25] <XioNoX>	 summary: wmf_auto_restart_apache2-htcacheclean.service Failed on idm-test1001:9100
[09:07:36] <XioNoX>	 not the same one though
[09:07:57] <XioNoX>	 yeah I think it's a bug in AM?
[09:08:00] <XioNoX>	 godog: ^
[09:09:02] <slyngs>	 Yes, that one fails across a number of hosts
[09:10:41] <volans>	 FYI I'll be playing with netbox-next so if anything happens there you know who's fault is ;)
[09:12:48] <godog>	 XioNoX: will take a look shortly
[09:13:15] <XioNoX>	 thx!
[09:13:29] <jinxer-wm>	 (SystemdUnitFailed) resolved: (2) rq-idm.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:22:19] <slyngs>	 godog: The service is called:  wmf_auto_restart_apache-htcacheclean but it says  wmf_auto_restart_apache2-htcacheclean in alertmanager
[09:26:53] <slyngs>	 Okay, so it was rolled out this the wrong name, and the incorrect service wasn't removed
[09:28:03] <godog>	 slyngs: cheers, ok so things were "wrong" in puppet due to left over state but otherwise working as intended (?)
[09:28:35] <slyngs>	 If you do a git log on modules/profile/manifests/idm.pp                                       11:28:29
[09:28:45] <slyngs>	 You can see that it's being rename
[09:29:07] <slyngs>	 I'm just testing out on my test server if prometheus will pick up if it's being disabled
[09:29:40] <godog>	 ah ok got it! thank you for the context
[09:30:20] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) Just noticed what has been probably in the radar for @cmooney for some time now: [[https://ne...
[09:30:23] <godog>	 I doubt it FWIW, most of the time removing resources from puppet means they are left as is on the filesystem
[09:30:37] <slyngs>	 I'm currently trying a systemctl disable and reset-failed on idm-test1001, let's see if that clears the alert in alertmanager
[09:31:25] <godog>	 pretty sure it will yeah, for the auto-restarts there are also timers to remove/disable FWIW
[09:32:18] <slyngs>	 Oh, right
[09:33:06] <slyngs>	 That fixed the issue. I'll just clean up the remaining three hosts
[09:33:13] <slyngs>	 https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed&q=name%3Dwmf_auto_restart_apache2-htcacheclean.service
[09:34:05] <godog>	 sweet, thank you
[09:35:45] <jbond>	 slyngs: re wmf_auto_restart_apache2-htcacheclean.service from what i can tell this is an old unit left over from a badly named define.  the real unit should be wmf_auto_restart_apache-htcacheclean.service.  i clean up the old files manually on idm[12] 
[09:36:10] <slyngs>	 exactly
[09:36:40] <slyngs>	 It got rolled out to arclamp1001 and arclamp2001 as well. I'll just remove them there as well
[09:36:42] <jbond>	 however its unclear to me if you actully need apache-htcacheclean  afaik uits disabled buyt default and i dont see anything managing that service so if yuo dont need it it be tempted to also remove  wmf_auto_restart_apache-htcacheclean
[09:36:50] <jbond>	 ack
[09:37:20] <jbond>	 fyi to remove i thik you can systemctl stop && systemctl disable && systemctl reset-failed
[09:37:35] <slyngs>	 Yep, that's what I've done :-)
[09:37:58] <jbond>	 cool :)
[09:38:36] <slyngs>	 I think maybe certain deb packages with Apache modules will enable the htcacheclean serviceand the auto_restart was added to ensure that the service would restart on package updates
[09:40:18] <jbond>	 slyngs: i think my point is either manage both the service apache-htcacheclean and the auto restart or neither
[09:40:55] <jbond>	 right now you manage the aytopr restart but not the service hich means you can get  the machine in a strange state e.g. on reboot 
[09:40:58] <jbond>	 loaded (/lib/systemd/system/apache-htcacheclean.service; disabled; vendor preset: enabled)
[09:41:05] <jbond>	 from idm-test notice disabled
[09:41:48] <slyngs>	 Okay, yes that makes sense, to keep them in sync
[09:43:43] <jbond>	 slyngs: it may also make senses to add this as a flag to the http class e.g. enable_htcacheclean instead of directly in the idm class
[09:44:47] <slyngs>	 That would be better, it's used in a few other places as well
[09:47:53] <slyngs>	 I'll just try my hand on a patch and send it your way
[09:48:00] <jbond>	 yes i did a quick check and it seems its running on 13/485 serveres with the httpd class  and i only see the service been managed in one place (modules/profile/manifests/opensearch/api/httpd_proxy.pp) however that dosn;t have the auto-restart so i thik i couple of other things will definetly beneift
[09:48:06] <jbond>	 sure thing thanks
[11:08:48] <volans>	 so... we have a small~ish problem
[11:09:17] <volans>	 the latest wheels for netbox generate a 127M artifacts/artifacts.bullseye.tar.gz
[11:09:21] <volans>	 gerrit has:
[11:09:24] <volans>	 [receive] maxObjectSizeLimit = 100m
[11:10:14] <volans>	 was 80M last time, I can check why is so big, but anyway we're close to the limit and need to re-think i
[11:10:17] <volans>	 *it
[11:13:26] <volans>	 ok weird, this might be a bug, we have duplicated wheels in the last generation, looking into it
[11:13:32] <volans>	 that said 80 is close to 100 :D
[11:16:06] <wikibugs>	 10CAS-SSO, 10netbox, 10Infrastructure-Foundations: Move Netbox authentication to python-social-auth - https://phabricator.wikimedia.org/T308002 (10SLyngshede-WMF)
[11:18:31] <wikibugs>	 10CAS-SSO, 10netbox, 10Infrastructure-Foundations: Move Netbox authentication to python-social-auth - https://phabricator.wikimedia.org/T308002 (10SLyngshede-WMF) I currently have a patch with the python-social-auth project to enable OIDC via CAS in the django-social-auth plugin. It still needs documentation...
[11:49:00] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Automate Netbox additions for new spine/leaf L3 networks. - https://phabricator.wikimedia.org/T333441 (10cmooney) p:05Triage→03Low
[11:49:44] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Automate Netbox additions for new spine/leaf L3 networks. - https://phabricator.wikimedia.org/T333441 (10cmooney)
[11:49:52] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney)
[12:41:05] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Is Vlan 2122 cloud-support1-b-codfw required? - https://phabricator.wikimedia.org/T327930 (10ayounsi) a:03ayounsi
[12:46:27] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Is Vlan 2122 cloud-support1-b-codfw required? - https://phabricator.wikimedia.org/T327930 (10ayounsi) Removed from Netbox, last step is the above Puppet change ready for reviews.
[12:48:10] <wikibugs>	 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Spicerack: Spicerack: add network devices support - https://phabricator.wikimedia.org/T306552 (10ayounsi)
[12:49:01] <wikibugs>	 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Spicerack: Spicerack: add network devices support - https://phabricator.wikimedia.org/T306552 (10ayounsi) 05Open→03Resolved Closing this task as the short term goals are done, medium terms have their own task.
[12:59:52] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Enable LLDP on SRX facing interfaces - https://phabricator.wikimedia.org/T320229 (10ayounsi) Enabled it on pfw3-codfw, and removed the exception on fasw-c-codfw and it's working as expected: ` pfw3-codfw# run show lldp neighbors     Local Interface    Parent Int...
[13:00:04] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Enable LLDP on SRX facing interfaces - https://phabricator.wikimedia.org/T320229 (10ayounsi) a:03ayounsi
[13:27:56] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) @aborrero cloudcontrol2004-dev is in a public VLAN that is what we didn't relocate it in B1. Bu...
[13:39:32] <sukhe>	 hi folks
[13:39:39] <sukhe>	 I am not sure there is a pattern here but I thought I will share in case there is
[13:39:54] <sukhe>	 I have been noticing one-off Puppet failures over the last two days or so, example: https://puppetboard.wikimedia.org/report/lvs5005.eqsin.wmnet/01991cda8cac29625a8ef11c89e56a444cd2ea62
[13:40:08] <sukhe>	 do we know if some network issues are at play?
[13:41:17] <sukhe>	 just sharing in case someone else noticed it as well
[13:41:26] <sukhe>	 running agent fixed the issue fwiw
[13:43:05] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Enable LLDP on SRX facing interfaces - https://phabricator.wikimedia.org/T320229 (10ayounsi) FYI, it's still needed to disable LLDP on switch interfaces facing the management routers.
[13:56:29] <jbond>	 sukhe: we are n a nmeeting at the moment ill take a look after we finish. one thing to say (before looking) is that yesterday afternoon after the switch upgrade puppet was unhappy for most of the afternoon which would result in a bunch of random faliures
[13:56:46] <sukhe>	 jbond: ok thank you, not urgent for sure!
[13:56:56] <sukhe>	 and if it helps, this one is from today (~30 mins ago I think)
[13:57:08] <sukhe>	 but yeah like I said, another agent run cleared it up
[13:57:27] <jbond>	 ack  then not related to yesterday :) ill take a look in a bit 
[13:57:40] <sukhe>	 <3
[13:57:51] <sukhe>	 just posting for awareness mostly
[13:57:58] <sukhe>	 no pending issues on my end
[14:13:29] <jinxer-wm>	 (SystemdUnitFailed) firing: httpbb_kubernetes_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:15:28] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Is Vlan 2122 cloud-support1-b-codfw required? - https://phabricator.wikimedia.org/T327930 (10ayounsi) 05Open→03Resolved For the record, Netbox changes {F36932140}
[14:17:39] <wikibugs>	 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10aborrero)
[14:23:31] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Proposal: create a framework to build containerized incident management  protects - https://phabricator.wikimedia.org/T265153 (10jbond)
[14:27:43] <jbond>	 y/goAm
[14:42:48] <wikibugs>	 10Puppet, 10netbox, 10Infrastructure-Foundations, 10SRE, and 3 others: Netbox: use the netbox to also sync networks - https://phabricator.wikimedia.org/T329669 (10jbond) FYI i have created a new implementation which produces the following data  {P45977}  > And this change is a good opportunity (while being...
[14:42:52] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Automate Netbox additions for new spine/leaf L3 networks. - https://phabricator.wikimedia.org/T333441 (10cmooney)
[14:43:27] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Use Junos BGP graceful-shutdown and shutdown features - https://phabricator.wikimedia.org/T320230 (10ayounsi) Upgrade doc updated: https://wikitech.wikimedia.org/w/index.php?title=Juniper_router_upgrade&diff=2064827&oldid=2016903 Receiver i...
[14:49:59] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) @Papaul we're gonna reimage this one onto new vlans (will happen to all the public vlan ones i...
[14:54:35] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, 10Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 (10aborrero)
[15:03:33] <godog>	 hi, I'm getting a CI failure on dns.git (master is failing too), I'm wondering if you could help debugging: https://integration.wikimedia.org/ci/job/operations-dns-lint-docker/4533/console
[15:05:35] <XioNoX>	 godog: that's my bad
[15:05:43] <volans>	 godog: this usually is because of removed prefixes in netbox
[15:05:48] <XioNoX>	 yeah
[15:05:50] <volans>	 for which the related file included in master is gone
[15:05:59] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, 10Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) We had a conversation about this today. Conclusions:  * we will migrate the remaining of cloudvirts to single NIC, so the se...
[15:06:11] <godog>	 hah, thank you that explains
[15:06:22] <volans>	  delete mode 100644 2.2.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa
[15:06:22] <volans>	  delete mode 100644 21.192.10.in-addr.arpa
[15:06:22] <godog>	 anything I can do/help with to fix it ?
[15:06:38] <XioNoX>	 sending a patch
[15:06:58] <godog>	 ack thanks
[15:09:07] <volans>	 and to be clear, the changes in the exported dns repo from netbox should have not been merged without asking the people involved in that change
[15:09:17] <volans>	 as that was merged as a spurious change with other "expected" changes
[15:09:49] <XioNoX>	 https://gerrit.wikimedia.org/r/c/operations/dns/+/904198 volans 
[15:10:47] <volans>	 +1ed
[15:10:48] <volans>	 thx
[15:11:47] <godog>	 sweet, thanks for the quick fix
[15:12:27] <XioNoX>	 godog: done! let me know if it's good now
[15:13:27] <godog>	 yeah I think we're good, cheers XioNoX 
[15:13:29] <jinxer-wm>	 (SystemdUnitFailed) resolved: httpbb_kubernetes_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:14:14] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Use Junos BGP graceful-shutdown and shutdown features - https://phabricator.wikimedia.org/T320230 (10ayounsi) 05Open→03Resolved a:03ayounsi
[15:14:55] <godog>	 yeah all good, locally I get FileNotFoundError: [Errno 2] No such file or directory: '/usr/sbin/gdnsd'
[15:15:02] <godog>	 hence "no unexpected errors"
[15:15:35] <volans>	 godog: follow the instructions for running locally without gdnsd
[15:16:02] <godog>	 thank you, jenkins passed and I'm happy as is
[15:16:10] <volans>	 tox -- -n
[15:28:10] <jobo>	 Hey I/F team, could you please complete feedback form for Sprint week https://docs.google.com/forms/d/e/1FAIpQLScRESJOLI_6N5REhzkgtOYRFOZIFhvfiLmJErrjQqmTCczENg/viewform?usp=sf_link 
[15:31:50] <volans>	 "You've already responded" :-P
[15:32:24] <XioNoX>	 I sent an email instead
[15:52:15] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Enable LLDP on SRX facing interfaces - https://phabricator.wikimedia.org/T320229 (10ayounsi) 05Open→03Resolved LLDP is now enabled on all the SRXs.  > FYI, it's still needed to disable LLDP on switch interfaces facing the management routers. To expand on thi...
[18:29:04] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10PyBal, 10SRE, 10serviceops: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. - https://phabricator.wikimedia.org/T239392 (10BCornwall)
[18:54:18] <jobo>	 XioNoX: i know, but Leo would like to have everyone’s responses in the form as there are slightly different questions.
[18:58:39] * lmata would appreciate the response in survey <3
[19:03:22] <lmata>	 thanks!