[01:30:05] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on testvm2005:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [02:01:21] 10netbox, 10ChangeProp, 06collaboration-services, 10GitLab, and 9 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9660226 (10tstarling) PhpRedis is getting behind KeyDB with [[https://github.com/phpredis/phpredis/issues/2466|#2466]]... [03:22:15] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:30:05] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on testvm2005:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [07:22:15] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:30:06] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on testvm2005:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [09:36:18] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), 13Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9660652 (10fnegri) [10:14:05] 10CAS-SSO, 06Infrastructure-Foundations: Enable self-service IDP two-factor authentication management - https://phabricator.wikimedia.org/T359552#9660738 (10SLyngshede-WMF) a:03SLyngshede-WMF Enabling of two-factor authentication is planed for after CAS 7 upgrade. [11:13:56] 10netbox, 10ChangeProp, 06collaboration-services, 10GitLab, and 9 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9660965 (10larissagaulia) [11:22:15] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:23:22] 10netbox, 06DC-Ops, 06Infrastructure-Foundations, 06SRE: sre.hardware.upgrade-firmware cookbook: product slug parsing - https://phabricator.wikimedia.org/T348036#9661372 (10Volans) @BTullis indeed, that's another new device type created with the wrong slug. I've updated the slug in Netbox to fix it. [13:23:28] 10Mail, 06Infrastructure-Foundations, 06SRE: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920#9661373 (10DBu-WMF) Hey @Dzahn this ticket number does not come up in search and when I add the ticket number to the url I get this message: Access Denied: Unknown Object (Task) This object is in a... [13:30:06] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on testvm2005:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [13:49:50] (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run on testvm2005:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [13:54:42] 10netops, 06SRE, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): 14Icinga BFD check failing - 14https://phabricator.wikimedia.org/T359198#9661583 (10fgiunchedi) 05Open→03Resolved 14This is fixed, I've undone my symlink bandaid. I've also reported the issue at https://bugs.debian.org/cgi-... [15:22:15] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:48:44] sukhe@re0.cr2-codfw> show route receive-protocol bgp 10.192.32.14 [15:48:55] I swear this always used to work for me but it's not working anymore [15:49:09] any idea why? did something change? I can't seem to find it in the manual as well [15:55:32] sukhe: `cr2-codfw> show bgp neighbor 10.192.32.14` shows that the peer is down [15:55:35] re0.cr2-codfw rpd[34132]: bgp_recv: read from peer 10.192.32.14 (External AS 64600) failed: Exec format error [15:55:42] no idea what this error means [15:55:48] maybe try to bounce pybal? [15:56:09] yeah this is worrying :[ [15:56:34] I mean codfw is depooled but yeah [15:57:56] sukhe: only 2 LVS in codfw? [15:58:02] no, four [15:58:38] (in meeting) [16:02:02] back [16:02:13] XioNoX: what are you seeing? [16:02:37] note for lvs2013, https://phabricator.wikimedia.org/T348218 [16:02:44] can it be related to this? [16:03:25] 10Mail, 06Infrastructure-Foundations, 06SRE: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920#9662130 (10Dzahn) @DBu-WMF Sorry, I tried. Then it's further restricted than just NDA-level due to Security. Please contact @Jgreen, @jhathaway or the [[ https://security.wikimedia.org/ | security team... [16:05:48] sukhe: looks like pybal is reseting the TCP session [16:06:27] in the pybal logs there is only Mar 26 16:05:06 lvs2013 pybal[4167108]: [bgp.BGPFactory@0x7fce2c6ff6e0] INFO: Connection received from 208.80.153.192 [16:06:39] is it possible to get more verbose logs? [16:08:06] so lvs2014 looks fine but not 2013 [16:08:13] XioNoX: that's the extent of what I know as well [16:08:39] sukhe: can pybal be reloaded? [16:08:52] no, just restart [16:08:54] but we can try that [16:08:55] let me do it [16:08:58] sure [16:09:06] done [16:09:38] try now? [16:11:15] Mar 26 16:11:06 lvs2013 pybal[495018]: [bgp.BGPFactory@0x7f238bde96e0] INFO: Connection received from 208.80.153.193 [16:13:11] lvs2013.codfw.wmnet.bgp > cr1-codfw.wikimedia.org.61615: Flags [R.] [16:13:19] it's still sending a reset [16:14:17] next step try to get more verbose logs on the pybal side? [16:15:49] I guess, I am asking bblack if he has seen this since I have not [16:17:32] and nothing really has changed on lvs2013 since the switch migration for there to be any issue [16:18:21] lvs2013 is in row C do I don't think it's related [16:20:56] I meant the work in T348218 [16:20:57] T348218: Migrate lvs2013 and lvs2014 codfw row A-B connections to new switches - https://phabricator.wikimedia.org/T348218 [16:21:59] in theory neither :) [16:22:10] ok [16:22:15] as it's not in the cr<->lvs path [16:22:43] whatever issue was there has persisted for a while and it seems like only came up because we restarted pybal yesterday [16:22:48] can you confirm if lvs2014 is OK? [16:23:00] I still get no output for show bgp neighbor :/ [16:24:56] sukhe: they're both down [16:25:14] fun! [16:25:22] been for 1d 1h [16:25:26] (both) [16:25:31] yeah that matches up the restart time for pybal [16:25:32] ok [16:25:45] why was it restarted? [16:25:55] a service was removed yesterday, AQS [16:30:22] XioNoX: bblack found the issue [16:30:30] sukhe: ah? [16:30:34] fixing it and pushing it; we had an incorrect regex in /^lvs201[3-4]$/ => "[ '208.80.153.192', '208.80.153.193' ]", # cr1-codfw,cr2-codfw [16:30:40] so no bgp-peer-addresses [16:31:04] alright [16:31:17] could be worse :) [16:31:21] thanks for the help, fixing and will check again [16:47:21] XioNoX: all good [16:47:41] awesome! [16:52:09] sorry that one is one me [16:52:29] previous edit was when I changed lvs2011 [16:53:20] topranks: no not at all [16:53:23] we both missed it -- hapepns [16:53:45] we are fixing this because in the past it has happened as well that we changed pybal.conf but didn't restart the service until later [16:53:54] only to discover a new issue a month later :) [16:53:56] yeah [16:53:59] putting up a patch now to alert us about that [16:54:04] that's the bit I should have anticipated but didn't [16:54:16] yeah Puppet doesn't do pybal restarts so it's on us [16:54:36] ah ok yeah [16:54:36] and in this case since we had an incorrect regex, we should have got an alert to restart and if we did, we would have picked it up then [16:54:48] but it's nice that codfw is depooled and we caught it today :) [16:54:51] the conf file doesn't trigger the service reboot [16:54:57] give thanks for the small things :) [17:24:49] :) fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1014563 [17:24:54] thanks both [17:50:06] (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run on testvm2005:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [18:48:48] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: 14Connect two hosts in codfw row A/B for switch migration testing - 14https://phabricator.wikimedia.org/T345803#9662842 (10Papaul) 05Open→03Resolved [19:22:15] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:50:06] (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run on testvm2005:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [23:22:16] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed