[00:35:00] (PyBalBGPUnstable) resolved: PyBal BGP sessions on instance lvs1018 are failing - TODO - https://grafana.wikimedia.org/d/000000488/pybal-bgp?var-datasource=eqiad%20prometheus/ops&var-server=lvs1018 - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [00:37:31] 10Traffic, 10SRE, 10ops-codfw: cp2035 IPMI and management console issues - https://phabricator.wikimedia.org/T333312 (10ssingh) Thanks @Jhancock.wm for the fix! I can confirm the host has been resolved. For posterity: repooling the host. [06:20:06] 10Traffic, 10Infrastructure-Foundations, 10SRE: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10ayounsi) Brett, this project on which Jameel is working for his internship, is to collect latency data from users to all of our DCs. This will help improve our current [[ https://gerrit.wi... [06:21:12] 10Traffic, 10Infrastructure-Foundations, 10SRE: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10ayounsi) a:03JameelKaisar [07:29:24] 10Traffic, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Jelto) [07:36:13] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [07:36:46] 10Traffic, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ayounsi) 05Open→03Resolved a:03ayounsi Thanks again everybody! [08:47:30] 10Traffic, 10Commons, 10SRE: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10TheDJ) >>! In T333042#8736491, @Lionel_Scheepmans wrote: > Hi folks. > > I'm in front of a very strange phenomenon probably linked to this bug, and th... [08:51:16] 10Traffic, 10Commons, 10SRE: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10TheDJ) OK, for doc taxon I can also reproduce now with that one specific link that AntiCompositeNumber found: for DC in esams eqiad codfw ulsfo eqsin... [08:54:22] 10Traffic, 10Commons, 10SRE: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10TheDJ) [09:12:33] 10Traffic, 10Commons, 10SRE: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10MatthewVernon) I think in these cases, removing the incorrect thumbnail will allow it to be recreated on next GET. [09:16:49] Is the sre.loadbalancer.restart-pybal cookbook the canonical way to restart pybal after adding a new service in lvs_setup ? [09:17:30] If that's the case, can I update the doc ? [09:19:13] I imagine the query should scope down to the correct alias like lvs-low-traffic-eqiad anv lvs-low-traffic-codfw? [09:30:22] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) Just noticed what has been probably in the radar for @cmooney for some time now: [[https://ne... [09:40:13] 10Traffic, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10elukey) [09:44:17] 10Traffic, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi) [09:45:32] 10Traffic, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10elukey) [10:24:23] vgutierrez: I'm about to go ahead and merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/904060 can you confirm running : [10:24:28] sudo cookbook sre.loadbalancer.restart-pybal --query 'P{lvs1020*,lvs2010*}' --reason "Adding mw-api-int service" --task-id T333120 [10:24:29] T333120: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 [10:24:30] then [10:25:03] sudo cookbook -d sre.loadbalancer.restart-pybal --query 'P{lvs1019*,lvs2009*}' --reason "Adding mw-api-int service" --task-id T333120 [10:25:07] is the right approach [10:25:18] * vgutierrez looking [10:25:31] Or would you rather I use the manual systemctl restart pybal way ? [10:26:54] claime: I think you can benefit from the aliases lvs-secondary + lvs-low-traffic rather than specific hosts [10:27:06] vgutierrez: checking [10:29:09] other than that, +1 [10:29:32] Hmm I may be doing query wrong but [10:30:07] --query 'A:lvs-secondary and A:lvs-low-traffic' doesn't work, neither does using that query in alias [10:30:54] hmmm my fault [10:31:01] lvs-secondary-eqiad should work [10:31:15] that or the cookbook allowed_aliases code is buggy [10:31:24] but it'll hit all seconday lvs right? Not just low-traffic ? [10:31:34] that's just one lvs :) [10:32:55] We're agreed that for this to work, lvs-secondary-eqiad must be in /etc/cumin/aliases.yaml ? [10:34:04] there is no low/high traffic alias currently on the cumin hosts... [10:34:09] yeah [10:34:22] there's just lvs-canary and lvs-$datacenter [10:34:24] neither primary/secondary ones [10:34:29] ok [10:34:42] I was afraid I was completely missing something [10:35:04] claime: so yeah.. your hosts are right so you can proceed [10:35:09] :D [10:35:12] Thanks vgutierrez <3 [10:35:16] let's see if we can add those aliases to cumin though [10:35:30] So, first secondaries, wait 2 minutes, then primaries ? [10:35:58] claime: yep [10:36:05] let's go then, thanks [10:39:11] so that's https://gerrit.wikimedia.org/r/c/operations/puppet/+/855692 [10:39:13] sigh [10:40:44] yes they were added and reverted because part of some other refactoring [11:00:20] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) `mw-api-int` and `mw-api-int-ro` services now in production, we can proceed with creating the envoy listeners in https://gerrit.wikimedia.org/r/c/operat... [11:01:18] All done, thanks :) [11:02:24] Should I update the https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service "Configure the load balancers" to use the cookbook ? [11:04:50] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [11:49:01] 10netops, 10Infrastructure-Foundations, 10SRE: Automate Netbox additions for new spine/leaf L3 networks. - https://phabricator.wikimedia.org/T333441 (10cmooney) p:05Triage→03Low [11:49:46] 10netops, 10Infrastructure-Foundations, 10SRE: Automate Netbox additions for new spine/leaf L3 networks. - https://phabricator.wikimedia.org/T333441 (10cmooney) [11:49:54] 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [12:41:07] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Is Vlan 2122 cloud-support1-b-codfw required? - https://phabricator.wikimedia.org/T327930 (10ayounsi) a:03ayounsi [12:46:28] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Is Vlan 2122 cloud-support1-b-codfw required? - https://phabricator.wikimedia.org/T327930 (10ayounsi) Removed from Netbox, last step is the above Puppet change ready for reviews. [12:48:12] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-tools, 10Spicerack: Spicerack: add network devices support - https://phabricator.wikimedia.org/T306552 (10ayounsi) [12:49:03] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-tools, 10Spicerack: Spicerack: add network devices support - https://phabricator.wikimedia.org/T306552 (10ayounsi) 05Open→03Resolved Closing this task as the short term goals are done, medium terms have their own task. [12:59:54] 10netops, 10Infrastructure-Foundations, 10SRE: Enable LLDP on SRX facing interfaces - https://phabricator.wikimedia.org/T320229 (10ayounsi) Enabled it on pfw3-codfw, and removed the exception on fasw-c-codfw and it's working as expected: ` pfw3-codfw# run show lldp neighbors Local Interface Parent Int... [13:00:06] 10netops, 10Infrastructure-Foundations, 10SRE: Enable LLDP on SRX facing interfaces - https://phabricator.wikimedia.org/T320229 (10ayounsi) a:03ayounsi [13:27:58] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) @aborrero cloudcontrol2004-dev is in a public VLAN that is what we didn't relocate it in B1. Bu... [13:43:06] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Enable LLDP on SRX facing interfaces - https://phabricator.wikimedia.org/T320229 (10ayounsi) FYI, it's still needed to disable LLDP on switch interfaces facing the management routers. [14:05:31] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [14:15:29] 10netops, 10Infrastructure-Foundations, 10SRE: Is Vlan 2122 cloud-support1-b-codfw required? - https://phabricator.wikimedia.org/T327930 (10ayounsi) 05Open→03Resolved For the record, Netbox changes {F36932140} [14:17:41] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10aborrero) [14:42:54] 10netops, 10Infrastructure-Foundations, 10SRE: Automate Netbox additions for new spine/leaf L3 networks. - https://phabricator.wikimedia.org/T333441 (10cmooney) [14:43:29] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Use Junos BGP graceful-shutdown and shutdown features - https://phabricator.wikimedia.org/T320230 (10ayounsi) Upgrade doc updated: https://wikitech.wikimedia.org/w/index.php?title=Juniper_router_upgrade&diff=2064827&oldid=2016903 Receiver i... [14:50:01] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) @Papaul we're gonna reimage this one onto new vlans (will happen to all the public vlan ones i... [14:54:37] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, 10Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [15:06:01] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, 10Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) We had a conversation about this today. Conclusions: * we will migrate the remaining of cloudvirts to single NIC, so the se... [15:14:16] 10netops, 10Infrastructure-Foundations, 10SRE: Use Junos BGP graceful-shutdown and shutdown features - https://phabricator.wikimedia.org/T320230 (10ayounsi) 05Open→03Resolved a:03ayounsi [15:27:34] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs4010.ulsfo.wmnet with OS bullseye [15:52:17] 10netops, 10Infrastructure-Foundations, 10SRE: Enable LLDP on SRX facing interfaces - https://phabricator.wikimedia.org/T320229 (10ayounsi) 05Open→03Resolved LLDP is now enabled on all the SRXs. > FYI, it's still needed to disable LLDP on switch interfaces facing the management routers. To expand on thi... [16:06:33] 10Traffic, 10Infrastructure-Foundations, 10SRE: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10BCornwall) Thanks for that, @ayounsi! Are you aware of https://gerrit.wikimedia.org/g/operations/software/latency-measurement ? It may or may not be relevant but I wanted to make sure it w... [16:14:48] 10Traffic, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10jcrespo) [16:37:49] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs4010.ulsfo.wmnet with OS bullseye completed: - lvs4010 (**WARN**) - Downtimed on Icinga/Aler... [16:50:33] 10Traffic, 10SRE, 10Wikipedia-Android-App-Backlog, 10Wikipedia-iOS-App-Backlog: Integrate In-App Internet censorship circumvention by domain fronting - https://phabricator.wikimedia.org/T327286 (10JTannerWMF) Thanks for creating this task its a valid request. Our team can't prioritize it right now but its... [17:40:20] 10Traffic, 10SRE, 10VPS-project-Codesearch, 10serviceops, 10Patch-For-Review: Consider using BindsTo instead of Requires to declare dependencies between systemd unit - https://phabricator.wikimedia.org/T284555 (10BCornwall) [17:42:29] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs4010.ulsfo.wmnet with OS bullseye [18:23:38] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs4010.ulsfo.wmnet with OS bullseye completed: - lvs4010 (**WARN**) - Downtimed on Icinga/Aler... [18:26:06] 10Traffic, 10PyBal, 10SRE: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715 (10BCornwall) [18:27:01] 10Traffic, 10PyBal, 10SRE: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715 (10BCornwall) 05Stalled→03In progress [20:10:52] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs4010.ulsfo.wmnet with OS bullseye [20:52:48] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs4010.ulsfo.wmnet with OS bullseye completed: - lvs4010 (**WARN**) - Downtimed on Icinga/Alertmanager - Disabled...