[05:29:08] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Create mechanism to disable IPv6 RA generation on irb interfaces when required - https://phabricator.wikimedia.org/T337057 (10ayounsi) I also like option 5 (hard-coding the conditional in Jinja to not configure RA if the device name starts... [06:59:33] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10MoritzMuehlenhoff) >>! In T330884#8865295, @ayounsi wrote: >> I copied over samplicator from bullseye-wikimedia to bookworm-wikimedia (the only dependency is glibc itself)... [08:19:48] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Spicerack: Retire sre.aqs.roll-restart cookbook - https://phabricator.wikimedia.org/T330889 (10MoritzMuehlenhoff) [08:20:28] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Spicerack: Retire sre.aqs.roll-restart cookbook - https://phabricator.wikimedia.org/T330889 (10MoritzMuehlenhoff) 05Open→03Resolved The old cookbook has been removed and the docs were updated. [08:47:42] (SystemdUnitFailed) firing: cadvisor.service Failed on ganeti6003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:52:42] (SystemdUnitFailed) resolved: cadvisor.service Failed on ganeti6003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:18:51] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Merge reimaging cookbooks - https://phabricator.wikimedia.org/T336491 (10MoritzMuehlenhoff) I've updated https://wikitech.wikimedia.org/wiki/Ganeti to point to the new cookbook [09:42:47] jbond: welcome back! [09:43:20] I'd nothing important for you, I was a little curious about something on the install servers though [09:43:38] Working on install1004 last week, I couldn't see in the iptables rules how we were allowing DHCP from our mgmt network [09:43:56] Network there is 10.65.0.0/16, which doesn't seem to be included in the rules [09:44:22] but DHCP works regularly for devices on this so I'm clearly missing something [09:45:24] jbond: actually nevermind, typing this out made me realise [09:45:46] well, making an assumption I'm figuring the management router doesn't use the source interface IP to relay the DCHP requests [09:50:25] topranks: not checked but yes thats probably the reason [09:51:05] yeah it came up while _tftp_ from that range was blocked (as part of the ztp testing), and I noticed no rules allowing that range [09:51:27] but thinking about dhcp specifically it's not needed for that I expect [09:53:29] topranks: no the mgmt host just need an ip address so never fetch anything via tftp [09:54:23] yep cool. we're gonna try to use http from apt server for the ztp too if we can [09:54:30] so hopefully won't need to adjust those rules [09:55:23] ack sgtm [12:55:32] 10SRE-tools, 10Discovery-Search, 10Infrastructure-Foundations, 10SRE, 10Spicerack: Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507 (10Gehel) [13:19:42] (SystemdUnitFailed) firing: netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:41:51] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Create mechanism to disable IPv6 RA generation on irb interfaces when required - https://phabricator.wikimedia.org/T337057 (10cmooney) >>! In T337057#8869828, @ayounsi wrote: > I also like option 5 (hard-coding the conditional in Jinja to n... [13:47:12] 10Puppet, 10Horizon: Allow providing a commit message for hieradata changes - https://phabricator.wikimedia.org/T250623 (10joanna_borun) [13:47:19] 10Puppet, 10Wikimedia Meet: Puppetize the jitsi instance - https://phabricator.wikimedia.org/T251040 (10joanna_borun) [13:47:40] 10Puppet, 10Horizon: Preserve formatting etc. in horizon hiera editor - https://phabricator.wikimedia.org/T250622 (10joanna_borun) [13:47:51] 10Puppet, 10Beta-Cluster-Infrastructure: puppetmaster config in deployment-prep may be inadvertently breaking store,logstash reports? - https://phabricator.wikimedia.org/T218175 (10joanna_borun) [13:48:09] 10Puppet, 10Cloud-VPS, 10MediaWiki-Vagrant: Vagrant -> mwvagrant alias in role::labs::mediawiki_vagrant is brittle - https://phabricator.wikimedia.org/T195592 (10joanna_borun) [13:49:42] (SystemdUnitFailed) resolved: netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:50:19] 10Puppet, 10Beta-Cluster-Infrastructure, 10Product-Infrastructure-Team-Backlog-Deprecated, 10VPS-Projects: Puppet failures on deployment-docker-changeprop01, deployment-docker-cpjobqueue01, deployment-push-notifications01, deployment-docker-mobileapps01, and deploy... - https://phabricator.wikimedia.org/T259812 [13:51:37] 10Puppet, 10Cloud-VPS, 10Data-Persistence, 10Thumbor, and 2 others: haproxy::site doesn't work as expected on the first puppet run - https://phabricator.wikimedia.org/T321684 (10joanna_borun) [13:52:04] 10Puppet, 10Cloud-VPS, 10cloud-services-team: Remove prod-specific bits from cloud puppetmasters - https://phabricator.wikimedia.org/T309281 (10joanna_borun) [14:53:35] XioNoX: whoa, gNMI looks neat [14:56:22] indeed, but complex [14:56:41] and doesn't have some nice to have features (like commit/rollback) [15:25:53] 10Puppet, 10Beta-Cluster-Infrastructure, 10Product-Infrastructure-Team-Backlog-Deprecated, 10VPS-Projects: Puppet failures on deployment-docker-changeprop01, deployment-docker-cpjobqueue01, deployment-push-notifications01, deployment-docker-mobileapps01, and deploy... - https://phabricator.wikimedia.org/T259812 [15:37:56] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox: use Custom Model Validation - https://phabricator.wikimedia.org/T310590 (10ayounsi) [15:38:59] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox: use Custom Model Validation - https://phabricator.wikimedia.org/T310590 (10ayounsi) p:05Triage→03Low [16:02:08] XioNoX: is the lack of commit/rollback inherent to gNMI? I thought it was just a platform limitation with Sonic? [16:02:52] only netconf supports it so far (thanks to Juniper) [16:03:18] gNMI could support it but it's not there yet, same for restconf [16:03:42] ok good to know [16:04:26] I guess maybe in a more automated world it's not as important as one time, you can revert the git changes and re-apply to go back [16:04:43] "commit confirmed" on operations is quite important still though, lest you break your own comms [16:04:58] yeah agreed [16:08:04] on the other hand for the cookbook actions (like update interfaces) we could get rid of the automatic commit confirmed/check to save some time and ressources [16:09:08] Until someone deletes em0 IP in Netbox or something :P [16:09:24] but tbh we could always probably think up edge cases to break things, it's very unlikely [16:10:04] yeah, that cookbook have safeguards, so it's unlikely and has been tested since it was introduced [16:11:59] hi all im wondering if IF is the correct owner for the keyholder phab tag (https://phabricator.wikimedia.org/tag/keyholder/) AFAIK its currently unonwned (cc moritzm jobo) [16:34:36] 10netops, 10Commons, 10Infrastructure-Foundations, 10Traffic, 10WMF-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10jbond) [17:20:51] jbond: I think so, yes, IIRC originally keyholder was written by someone in releng with Faidon on the SRE side and most of the maintenance since then has done by us. Plus, it's used by Cumin, Homer as well [17:22:10] great thanks moritzm [17:27:28] 10Mail, 10Infrastructure-Foundations, 10Sustainability (Incident Followup): Follow up for mx1001 incident: 2023-05-17 MXQueueHigh on mx1001 - https://phabricator.wikimedia.org/T337257 (10Dzahn) [18:52:32] 10netops, 10Infrastructure-Foundations, 10SRE: Junos: use mgmt_junos for syslog and ntp - https://phabricator.wikimedia.org/T320244 (10ayounsi) 05Open→03Resolved a:03ayounsi All done where possible. [18:52:42] 10netops, 10Infrastructure-Foundations, 10SRE: Junos: resolve DNS through mgmt_junos - https://phabricator.wikimedia.org/T317175 (10ayounsi) All done where possible. [18:58:07] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Use mgmt_junos on all network devices - https://phabricator.wikimedia.org/T327862 (10ayounsi) the fasw and asw1-eqsin switches didn't create the `mgmt_junos` routing instance as they should have. https://gerrit.wikimedia.org/r/922161 works... [18:59:45] 10netops, 10Infrastructure-Foundations, 10SRE: Upgrade asw1-eqsin - https://phabricator.wikimedia.org/T332395 (10ayounsi) [20:21:42] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-api-int_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:24:44] 10netops, 10Infrastructure-Foundations, 10SRE: Use mgmt_junos on all network devices - https://phabricator.wikimedia.org/T327862 (10ayounsi) 05Open→03Resolved a:03ayounsi Going to close this task as this is as far as we can go due to the fasw switches not being easily upgraded. [20:25:33] 10netops, 10Infrastructure-Foundations, 10SRE: Junos: resolve DNS through mgmt_junos - https://phabricator.wikimedia.org/T317175 (10ayounsi) 05Open→03Resolved a:03ayounsi [21:21:42] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-api-int_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed