[06:27:14] 10Traffic, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10hashar) [07:10:41] 10Traffic, 10SRE: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10elukey) >>! In T334078#8759196, @Ottomata wrote: > From a brief glance, those look like normal consumer reassignment messages. Probably shouldn't be alerts. @Ottomata I thought so yes, but I got a... [07:16:55] 10Traffic, 10Infrastructure-Foundations, 10SRE-tools: Abstract LVS restart using cookbook - https://phabricator.wikimedia.org/T334166 (10ayounsi) p:05Triage→03Low [07:39:56] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877 (10ayounsi) Thanks for the feedback! > Weighing this against the costs of maintaining them properly, that's the big question here. Indeed :) I opened... [07:54:00] 10netops, 10Infrastructure-Foundations, 10SRE: Automate EVPN switch underlay BGP neighbor peerings - https://phabricator.wikimedia.org/T327934 (10cmooney) 05Resolved→03Open Re-opening as there are some EVPN elements outside the 'protocols bgp' context that also need to be added. Will submit patch. [08:08:27] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877 (10cmooney) That codfw error is interesting actually, it makes me wonder why we have the "no-resolve" command on those routes? Without that the error wo... [09:32:01] 10Traffic, 10Infrastructure-Foundations, 10SRE, 10SRE-tools: Abstract LVS restart using cookbook - https://phabricator.wikimedia.org/T334166 (10Volans) FYI there is already a [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/loadbalancer/restart-pybal.... [11:04:40] hello! I have a service running in k8s that I'd like to move to lvs_setup - is today an okay time for this? https://gerrit.wikimedia.org/r/c/operations/puppet/+/899607/ [11:05:01] I saw reference to a cookbook for doing this recently, is the manual process still the best way? [11:06:28] hnowlan: I'm not authoritative on this, but see also https://phabricator.wikimedia.org/T334166#8761905 [11:06:46] volans: ahh I see, ty [11:06:47] so I'm not sure on the curent status [12:49:06] hnowlan: I can help :) [12:49:26] You can use the cookbook, but you need to give it the actual servers to run on, the aliases are broken [12:49:56] Assuming the primaries and secondaries for low_traffic are still the same [12:50:01] 544 sudo cookbook sre.loadbalancer.restart-pybal --query 'P{lvs1020*,lvs2010*}' --reason "Adding mw-api-int service - restarting secondary LVS" --task-id T333120 [12:50:03] T333120: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 [12:50:08] 546 sudo cookbook sre.loadbalancer.restart-pybal --query 'P{lvs1019*,lvs2009*}' --reason "Adding mw-api-int service - restarting primary LVS" --task-id T333120 [12:53:33] 10Traffic, 10Infrastructure-Foundations, 10SRE, 10SRE-tools: Abstract LVS restart using cookbook - https://phabricator.wikimedia.org/T334166 (10Clement_Goubert) FWIW, the cookbook can be used, but it needs to be given the actual lvs servers to run on. Assuming `lvs1020` and `lvs2010` are secondaries, `lvs1... [13:04:24] 10Traffic, 10Infrastructure-Foundations, 10SRE, 10SRE-tools: Abstract LVS restart using cookbook - https://phabricator.wikimedia.org/T334166 (10Volans) Thanks for the clarification @Clement_Goubert [13:05:44] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 2 others: Migrate cxserver to mw-api-int - https://phabricator.wikimedia.org/T334204 (10Clement_Goubert) [13:08:53] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [13:10:02] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [13:35:32] hnowlan: there should be no concern from Traffic's side I think for doing it today [13:35:48] claime: out of curiosity, which alias fails? [13:36:03] sukhe: the primary/secondary high/low ones [13:36:13] see the link of the reverted change in the task [13:36:14] ^this [13:36:44] ah thanks [13:37:25] yep, this was reverted when we were adding new LVS hosts and the transitory stage was causing issue [13:37:28] s [13:37:31] yeah having to manually specify them bring the risk of human error back [13:38:09] but I think we can at least revert the aliases, if nothing else, specify the hosts manually for now vs people doing it all the time [13:38:13] will fix and submit a patch for review [13:38:28] the aliases were depending on some of the other changes [13:38:41] yep, the current Puppet code at that time [13:38:45] which we reverted [13:38:49] we being Traffic [13:53:48] sukhe: thanks! [13:54:00] claime: sweet [14:12:44] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs6003.drmrs.wmnet with OS bullseye [14:21:59] proceeding with the change now fyi [14:25:45] just to confirm, for low-traffic: secondaries are still lvs1020 and lvs2010, primaries are still lvs1019, lvs2009 right? [14:27:38] yes [14:27:40] you can verify from [14:27:43] modules/profile/manifests/lvs/configuration.pp [14:27:45] but you are right [14:29:01] great, just wanted to be 100% sure :D [14:29:04] np! [14:29:08] feel free to post here [14:29:10] proceeding with secondaries now if that's okay [14:31:20] sure [14:32:42] (SystemdUnitFailed) firing: (5) varnishkafka-eventlogging.service Failed on cp3064:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:33:31] done! safe to proceed with primaries? [14:35:01] looking [14:35:21] hnowlan: go ahead [14:37:42] (SystemdUnitFailed) resolved: (5) varnishkafka-eventlogging.service Failed on cp3064:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:37:59] thanks! going [14:40:26] looks okay. Thanks for the help! [14:40:41] np! thanks! [14:57:31] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs6003.drmrs.wmnet with OS bullseye executed with errors: - lvs6003 (**FAIL**) - Downtimed on... [14:57:44] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs6003.drmrs.wmnet with OS bullseye [15:02:55] 10Traffic, 10SRE: varnish-frontend-fetcherr: Assert error in vslc_vtx_next, 100% CPU usage - https://phabricator.wikimedia.org/T253093 (10ssingh) 05Resolved→03Open ` Apr 06 14:27:14 cp3064 varnishkafka[1513247]: Condition(c->offset <= c->vtx->len) not true. Apr 06 14:27:14 cp3064 systemd[1]: varnishkafka... [15:05:48] 10netops, 10Infrastructure-Foundations: Bring Juniper switches in eqiad racks E5-7 and F5-7 online and ready for servers - https://phabricator.wikimedia.org/T334230 (10cmooney) p:05Triage→03Medium [15:08:37] 10netops, 10Infrastructure-Foundations: Cabling for Eqiad racke E5-7 and F5-7 - https://phabricator.wikimedia.org/T334231 (10cmooney) [15:09:07] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) >>! In T292095#8715082, @Jclark-ctr wrote: > @cmooney Racks e5-7 f5-7 have been cabled and racked do you want to use same ticket f... [15:09:19] 10netops, 10Infrastructure-Foundations: Cabling for Eqiad racke E5-7 and F5-7 - https://phabricator.wikimedia.org/T334231 (10cmooney) p:05Triage→03Low [15:41:12] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Cabling for Eqiad racke E5-7 and F5-7 - https://phabricator.wikimedia.org/T334231 (10cmooney) [15:42:57] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs6003.drmrs.wmnet with OS bullseye completed: - lvs6003 (**WARN**) - Downtimed on Icinga/Aler... [15:53:59] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs6003.drmrs.wmnet with OS bullseye [16:34:56] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs6003.drmrs.wmnet with OS bullseye completed: - lvs6003 (**WARN**) - Downtimed on Icinga/Aler... [17:32:34] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e7d20917-1f70-4c85-bea4-4fae89694441) set by cmooney@cumin1001 f... [17:33:03] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=09fdc8d3-92d3-4c3b-8e46-8c1befa6a846) set by cmooney@cumin1001 f... [17:36:30] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs3007.esams.wmnet with OS bullseye [18:07:29] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [18:19:03] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs3007.esams.wmnet with OS bullseye completed: - lvs3007 (**PASS**) - Downtimed on Icinga/Aler... [18:20:49] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [20:03:01] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [21:05:33] 10Traffic: Upgrade lvs1013-1016 firmware - https://phabricator.wikimedia.org/T334259 (10BCornwall) [21:05:52] 10Traffic: Upgrade lvs1013-1016 firmware - https://phabricator.wikimedia.org/T334259 (10BCornwall) 05Open→03In progress p:05Triage→03Medium