[00:49:02] 06Traffic: upgrade to trafficserver 9.2.9 - https://phabricator.wikimedia.org/T388035#10644831 (10BCornwall) [08:15:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [08:25:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [08:59:33] 06Traffic, 10Liberica: Alert on deployed config not being used - https://phabricator.wikimedia.org/T389175 (10Vgutierrez) 03NEW [08:59:41] 06Traffic, 10Liberica: Alert on deployed config not being used - https://phabricator.wikimedia.org/T389175#10645521 (10Vgutierrez) p:05Triage→03Medium [11:56:27] 06Traffic, 10Cloud-VPS (Quota-requests): Increase quota for Traffic cloud project - https://phabricator.wikimedia.org/T389196 (10Fabfur) 03NEW [12:05:35] 10netops, 06Infrastructure-Foundations, 10ops-magru, 13Patch-For-Review: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774#10646249 (10cmooney) 05Open→03Resolved Router stable and config added to automation templates, closing task. [12:33:48] 06Traffic, 10Cloud-VPS (Quota-requests): Increase quota for Traffic cloud project - https://phabricator.wikimedia.org/T389196#10646328 (10aborrero) I see the project has: ` 9 / 12 instances. 12 / 12 VCPUs. 23.0 GB / 24.0 GB RAM. 5 / 80 GB volume space. ` (per https://openstack-browser.toolforge.org/project/t... [12:34:09] 06Traffic, 10Cloud-VPS (Quota-requests): Increase quota for Traffic cloud project - https://phabricator.wikimedia.org/T389196#10646329 (10aborrero) p:05Triage→03Medium [12:34:23] 06Traffic, 10Cloud-VPS (Quota-requests): Increase quota for Traffic cloud project - https://phabricator.wikimedia.org/T389196#10646332 (10aborrero) a:03Raymond_Ndibe [13:22:53] 06Traffic, 10Liberica: liberica as a replacement for bird on dns boxes - https://phabricator.wikimedia.org/T389201 (10Vgutierrez) 03NEW [13:25:33] 06Traffic, 10Liberica: liberica as a replacement for bird on dns boxes - https://phabricator.wikimedia.org/T389201#10646493 (10Vgutierrez) p:05Triage→03Medium [13:29:36] hey vgutierrez ! Just wanted to touch base on T387569 . We're preparing for the Opensearch migration, so probably won't get to this until Friday at the earliest [13:29:36] T387569: Update Elastic puppet code to filter LVS config based on cluster membership - https://phabricator.wikimedia.org/T387569 [13:50:34] 06Traffic, 10Liberica: Support UDP services - https://phabricator.wikimedia.org/T389210 (10Vgutierrez) 03NEW [13:50:37] 06Traffic, 10Liberica: Support UDP services - https://phabricator.wikimedia.org/T389210#10646673 (10Vgutierrez) p:05Triage→03Medium [13:54:47] 06Traffic, 10Liberica: Provide DNS healthchecks - https://phabricator.wikimedia.org/T389211 (10Vgutierrez) 03NEW [13:56:05] 06Traffic, 10Liberica: Provide NTP healthchecks - https://phabricator.wikimedia.org/T389212 (10Vgutierrez) 03NEW [14:03:42] hey folks! After the switchover completes (+ safe time), would it be ok to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128343 and restart pybals for low-traffic in eqiad/codfw? [14:05:11] elukey: works for us :) [14:05:29] <3 [14:05:36] I'll ping you later on then [14:05:40] please do [14:13:23] 06Traffic, 10Liberica: Withdraw associated BGP prefix if realservers aren't healthy - https://phabricator.wikimedia.org/T389216 (10Vgutierrez) 03NEW [14:14:04] 06Traffic, 10Liberica: liberica as a replacement for bird on dns boxes - https://phabricator.wikimedia.org/T389201#10646849 (10Vgutierrez) [14:52:40] 06Traffic, 10Cloud-VPS (Quota-requests): Increase quota for Traffic cloud project - https://phabricator.wikimedia.org/T389196#10647003 (10Fabfur) @aborrero thanks for this, yeah no need to double volume space, sorry! [15:51:38] sukhe: o/ [15:51:50] do you have 5 mins for the pybal restarts? [15:51:58] I'd follow https://wikitech.wikimedia.org/wiki/LVS#Configure_the_load_balancers basically [15:52:10] meeting at the moment elukey [15:52:18] ah snap okok! [15:52:22] elukey: meeting but will ping when done [15:59:10] elukey: let's do it? [16:00:49] sure! [16:01:06] I am going to start with disabling puppet [16:01:19] ok [16:02:08] all right, done [16:02:12] proceeding with merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128343 [16:05:02] rolling out now to the puppetservers [16:06:06] ok [16:06:58] sukhe: need to revert, something horrible happened that I should have imagined [16:07:11] conftool removed the wikikube workers that were pooled [16:07:14] elukey: uh? [16:07:21] so now there is no backend behind the lvs [16:07:26] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128903 [16:07:38] in fact https://maps.wikimedia.org/ is not responding [16:07:39] sigh [16:07:41] that is not what you desired? I thought you wanted that [16:07:42] ok [16:07:55] should be a quick add, just merge it [16:08:18] yep doing now [16:08:29] what I wanted was moving smoothly to the kubesvc [16:08:36] but without dropping live traffic [16:08:37] my bad [16:08:52] so then in that case it needs to be broken up further [16:08:59] I guess you need to do it in two steps? Switch the lvs backend, wait for traffic to drain, then remove the leftovers? [16:08:59] yep [16:09:11] yeah [16:10:57] ok back in business [16:10:59] sigh [16:11:35] that's ok <3 [16:11:36] sending two patches [16:11:42] sorry I misunderstood your intention as well [16:13:35] 06Traffic, 13Patch-For-Review: HAProxy service should not start if TLS material is invalid - https://phabricator.wikimedia.org/T388147#10647473 (10Fabfur) [16:15:26] 10netops, 06Infrastructure-Foundations, 10ops-drmrs: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10647485 (10RobH) > After changing the router side of the qsfp and fiber port back to solid green. > > Can you test it on your side? > > For information, you no longer ha... [16:19:59] sukhe: all right trying for the second time :D [16:20:06] elukey: you can ease a bit of pain by merging on a backup first then picking _a_ low-traffic in either eqiad or codfw, instead of both (given it is active_active anyway) [16:20:19] gl! [16:20:24] yes yes [16:20:34] that was the plan all along to avoid outages [16:20:40] but I managed to create one anyway [16:20:54] it never happened [16:21:00] the outage [16:21:04] we don't talk about it :P [16:21:15] more seriously, stop being so hard on yourself! [16:21:46] nah it is fine, I am a bit sad since I realized the problem only after pressing "yes" in puppet-merge [16:21:59] it was like some synapses in my brain fired [16:22:29] elukey: https://media.tenor.com/x9LOF4HF7_AAAAAC/ive-made-a-huge-mistake-mistake.gif [16:22:37] let's put it in the error budget of the SLO :P [16:22:40] :D [16:24:15] so codfw has less traffic (for maps I mean), I'll start from lvs2014.codfw.wmnet that is the low traffic secondary [16:24:41] yes (secondary for everything but yes). then do 2013 [16:26:02] ok so I guess that the following is unrelated [16:26:03] Error: Could not send report: Error 500 on SERVER: Server Error: Could not autoload puppet/reports/logstash: Cannot invoke "jnr.netdb.Service.getName()" because "service" is null [16:26:27] it happens right after the catalog is applied [16:28:22] That's been an issue since friday, I believe [16:28:22] yes I see it also on lvs4009 [16:28:31] okok brett thanks for confirming :) [16:28:39] yep unrelated for sure [16:28:41] restarting pybal on 2014 [16:29:17] and checking ipvsadm [16:30:37] looks good! [16:30:44] proceeding with lvs2013 then, ok sukhe ? [16:30:52] checking [16:31:09] I see the wikikube backend ips for 10.2.1.13:6543 [16:31:17] 06Traffic: HAProxy service should not start if TLS material is invalid - https://phabricator.wikimedia.org/T388147#10647619 (10Fabfur) [16:31:18] that is the expected result [16:32:13] looks good [16:32:19] elukey: T388629 FYI [16:32:19] T388629: puppet error at the end of the run on prometheus2008: Could not autoload puppet/reports/logstash: Cannot invoke "jnr.netdb.Service.getName()" because "service" is null - https://phabricator.wikimedia.org/T388629 [16:34:42] and 2013 done, going to wait a bit for the new traffic datapoints and I'll move to eqiad [16:37:26] for eqiad, I guess that I should only touch lvs1020 (secondary) and lvs1019 (active), since 1013 is liberica [16:37:36] is my understanding right? [16:37:42] 1013 is experimental host [16:37:44] yeah [16:37:55] 10netops, 06Infrastructure-Foundations, 10ops-drmrs: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10647639 (10cmooney) As discussed - somewhat clutching at straws at this point - we're gonna try moving the link/optic from port 48 to port 49 on the switch side. I've recon... [16:37:55] so 1020 and then 1019 [16:38:08] yeppa [16:39:15] 06Traffic: HAProxy service should not start if TLS material is invalid - https://phabricator.wikimedia.org/T388147#10647648 (10Fabfur) [16:40:16] all good on 1020, doing 1019 [16:40:23] nice! [16:42:09] FIRING: LVSHighRX: Excessive RX traffic on lvs1019:9100 (eno1np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1019 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [16:42:47] and 1019 done [16:43:44] now I am going to file the cleanup patch [16:43:48] ok! [16:43:53] hmmm that's probably a side effect of the switchover [16:43:56] (the alert) [16:44:20] vgutierrez: well can or cannot be though. we have beeing seeing it on 2013 and even intermittently 1019 [16:44:27] but yeah, timing is suspect [16:44:42] sukhe: sure.. both low-traffic LVS in eqiad & codfw [16:44:47] yeah [16:44:53] we need to tune that alert anyways [16:45:48] final one https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128910 [16:45:51] I mean final for today :P [16:47:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs1019:9100 (eno1np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1019 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [16:48:09] FIRING: LVSHighRX: Excessive RX traffic on lvs1019:9100 (eno1np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1019 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [16:48:32] elukey: so I can't migrate kartotherian to IPIP anymore ;P ( [16:48:52] vgutierrez: :D [16:48:58] sukhe: done, thanks a lot! <3 [16:49:03] till somebody figures out IPIP inbound traffic on k8s [16:49:05] <3 [16:49:36] vgutierrez: IPIP tattoo when? [16:50:58] 10netops, 06Infrastructure-Foundations, 10ops-drmrs: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10647692 (10cmooney) It's been moved to port 49 now, but switch is still reporting no TX light on the second lane: ` Mar 18 16:42:37 asw1-b12-drmrs fpc0 qsfp-0/0/49 plugged... [16:51:08] I'll try to start the cleanup of old kartotherian lvs configs tomorrow https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128344 [16:51:53] cool [16:52:24] RESOLVED: LVSHighRX: Excessive RX traffic on lvs1019:9100 (eno1np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1019 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [17:15:54] 06Traffic: HAProxy service should not start if TLS material is invalid - https://phabricator.wikimedia.org/T388147#10647830 (10Vgutierrez) 05Open→03Resolved [18:08:54] Hey folks! I'm seeing an hourly auth failure for the openstack user 'traffic-cloud-dns-manager' -- that probably means your acme-chief setup is failing as users with that name are usually involved in the call-and-response for certs. [18:08:59] Can someone have a look? [18:09:14] 10netops, 06Infrastructure-Foundations, 10ops-drmrs: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10648248 (10RobH) Ticket updated to move the link to router port 3. [18:12:21] brett ^^ could you take a look / open a task? [18:14:02] andrewbogott: Sorry to hear that! Will do [18:14:22] thanks! [18:16:47] 10Acme-chief, 06Traffic: Hourly auth failures are occurring for the openstack user 'traffic-cloud-dns-manager' - https://phabricator.wikimedia.org/T389241 (10BCornwall) 03NEW [18:17:04] 10Acme-chief, 06Traffic: Hourly auth failures are occurring for the openstack user 'traffic-cloud-dns-manager' - https://phabricator.wikimedia.org/T389241#10648289 (10BCornwall) p:05Triage→03Medium [18:21:44] thx [18:37:04] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10648385 (10cmooney) [18:45:36] 06Traffic: upgrade to trafficserver 9.2.9 - https://phabricator.wikimedia.org/T388035#10648398 (10BCornwall) [18:48:36] 06Traffic, 10Liberica: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10648403 (10cmooney) @Vgutierrez I hit on a small discrepancy in Netbox, I think we just need to clean it up but wanted to check. This port on asw1-b13-drmrs had the cable on port et-0/0/17 removed,... [20:11:58] 06Traffic, 13Patch-For-Review: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737#10648672 (10BCornwall) [20:12:12] 06Traffic, 13Patch-For-Review: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737#10648674 (10BCornwall) 05Open→03In progress [20:59:46] 06Traffic, 13Patch-For-Review: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737#10648834 (10BCornwall) [21:07:27] 06Traffic: upgrade to trafficserver 9.2.9 - https://phabricator.wikimedia.org/T388035#10648899 (10BCornwall) [22:02:40] 06Traffic, 13Patch-For-Review: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737#10649325 (10BCornwall) [22:38:35] 06Traffic, 13Patch-For-Review: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737#10649405 (10BCornwall) [23:01:19] 06Traffic, 13Patch-For-Review: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737#10649467 (10BCornwall)