[00:05:25] FIRING: SystemdUnitFailed: logrotate.service on cp3080:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:17:06] ^ should resolve [00:20:25] RESOLVED: SystemdUnitFailed: logrotate.service on cp3080:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:23:53] 06Traffic, 10Data-Engineering (Q3 2025 January 1st - March 31th), 10DPE HAProxy Migration, 13Patch-For-Review: Fix `webrequest_frontend` kafka timestamp mismatch with in-data `dt` field - https://phabricator.wikimedia.org/T388397#10650404 (10Vgutierrez) Maybe I'm misreading the task description but from >... [06:51:55] 06Traffic, 13Patch-For-Review: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737#10650434 (10Vgutierrez) [07:06:04] 06Traffic: varnishkafka needs to be linked against libvarnishapi3 - https://phabricator.wikimedia.org/T389322 (10Vgutierrez) 03NEW [07:06:07] 06Traffic: varnishkafka needs to be linked against libvarnishapi3 - https://phabricator.wikimedia.org/T389322#10650451 (10Vgutierrez) p:05Triage→03High [07:09:13] 06Traffic, 13Patch-For-Review: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737#10650452 (10Vgutierrez) [07:11:50] 06Traffic, 10Liberica: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10650454 (10Vgutierrez) > Have you been in touch with dc-ops about removing this cable on site? Nope, I haven't performed any action that would lead to physical changes in any POP related to this task. [07:12:08] topranks: ^^ [07:16:18] 10Acme-chief, 06Traffic: Hourly auth failures are occurring for the openstack user 'traffic-cloud-dns-manager' - https://phabricator.wikimedia.org/T389241#10650457 (10Vgutierrez) it looks like traffic-acmechief01 lost the traffic-cloud-dns-manager credentials at some point: ` root@traffic-acmechief01:/etc/acme... [07:18:13] vgutierrez: ok, well in general we need to keep an eye on this. [07:18:41] let's open a task then [07:18:50] The automation is not perfect when it comes to these additional links - they need to be added/removed manually. And we can’t really just mess up netbox and let it drift from what’s on site [07:19:08] Yeah it’s no big deal here, I can re-add the link that got removed [07:19:44] But we probably need to check if we’ve any other orphans like this [07:20:21] I guess esams and magru are the other sites we have more than one link on the lvs also (L3 switches) [07:21:10] yep [07:21:13] that's likely [07:21:27] but drmrs was the only one with several ports configured on the LVS FWIW [07:21:52] Was it? Ok might be easier then [07:22:07] so it looks like drmrs used one port per switch on the lvs and the others just used vlans [07:22:17] I’m confused I was sure we had it in magru and esams [07:22:42] yeah vlan trunk was an option between switches in the design but we didn’t go that way [07:23:06] topranks: see https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128449 [07:23:30] topranks: drmrs is the only one with 2 interfaces on profile::lvs::interface_tweaks: [07:23:40] the alternative is that esams and magru totally missed that :| [07:24:01] is it related to https://phabricator.wikimedia.org/T367731 ? [07:24:43] XioNoX: yep [07:25:22] Ah that makes sense [07:25:33] We already cleaned up magru/esams? [07:26:07] we can use that existing task to deal with drmrs too, no need for a new task [07:30:42] 10Acme-chief, 06Traffic: Hourly auth failures are occurring for the openstack user 'traffic-cloud-dns-manager' - https://phabricator.wikimedia.org/T389241#10650482 (10Vgutierrez) @Andrew what's the recommended way of injecting custom secrets on a puppetserver nowadays? [07:50:53] 10Acme-chief, 06Traffic: Hourly auth failures are occurring for the openstack user 'traffic-cloud-dns-manager' - https://phabricator.wikimedia.org/T389241#10650499 (10aborrero) related docs: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Service_accounts [08:19:57] 06Traffic, 10Liberica: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10650560 (10cmooney) >>! In T384477#10650454, @Vgutierrez wrote: >> Have you been in touch with dc-ops about removing this cable on site? > Nope, I haven't performed any action that would lead to phys... [08:21:30] XioNoX: remote hands still haven't moved the link in cr1-drmrs from port 1 to port 3 [08:21:52] so I think that explains why the switch sees it as a single 40G again - all 4 lanes are down/equal still [08:22:17] I still doubt moving port will do anything so likely yeah next step we'll get them to swap to another cable [08:24:30] agreed... [08:39:16] 10netops, 06Traffic, 06Infrastructure-Foundations: drmrs/esams/magru LVS : remove cross-rack links - https://phabricator.wikimedia.org/T367731#10650600 (10cmooney) Doing a bit of an audit here to assess the current situation, we have the following cables in place which need to be removed: |Site|LVS|Cable|Sw... [08:39:37] 10Acme-chief, 06Traffic: Hourly auth failures are occurring for the openstack user 'traffic-cloud-dns-manager' - https://phabricator.wikimedia.org/T389241#10650601 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez traffic-cloud-dns-manager credentials reset following the instructions available here: ht... [08:44:20] hello folks :) [08:44:38] I'd need to clean up old kartotherian LVS configs, starting from https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128344 [08:44:47] lemme know if it is a good time or not [08:45:28] I also have a question - reading from https://wikitech.wikimedia.org/wiki/LVS#Remove_a_load_balanced_service it is mentioned in "Remove network probes / monitoring" that the DNS discovery record needs to be removed beforehand, otherwise it will trigger an error [08:45:42] 06Traffic, 13Patch-For-Review: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737#10650636 (10Vgutierrez) [08:45:45] in this case the discovery record is in use and "shared" between 3 VIP:port configs [08:45:53] (I am removing the oldest two basically) [08:45:59] do I need to do anything extra? [08:46:47] I think it could be a good time now, before the DC switchover [08:47:18] about the second question, I'd wait for someone more expert than me about this :) [08:47:29] * vgutierrez reading [08:52:07] elukey: nothing extra AFAIK cause the network probes target the VIP:probe and that's still up & running in your case [08:53:02] super thanks for double checking [08:53:29] proceeding with lvs_setup then [08:54:57] less VIPs to move to IPIP :P [08:55:04] elukey: not true [08:55:10] same VIPs, different point in time [08:55:32] unless k8s starts handling their own ingress of course [08:56:22] hmm isn't production -> lvs_setup a NOOP in pybal terms' [08:57:01] vgutierrez: I am removing two ports though, less cruft, oh come on you are never happy :D [08:57:22] yep yep the spicy part is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128345 [08:57:40] I am running puppet now on dns nodes as specified in the docs [08:58:55] elukey: so yeah.. we will have some noise regarding VIP:80 and VIP:443 [09:00:57] and your change is definitely a NOOP in the dnsboxes [09:01:01] given the VIP is still there [09:01:09] and DNS doesn't know anything about L4 details [09:01:24] I am going to wait for the next patch, the MW train is going to be rolled out now [09:01:37] ack [09:01:52] vgutierrez: I am aware of that but I am following the docs, just to be sure to not forget anything along the way [09:03:17] sure [09:36:56] 06Traffic, 10DNS, 06SRE: Migrate PDNS recursor config to use /etc/powerdns/recursor.d ? - https://phabricator.wikimedia.org/T389333 (10MoritzMuehlenhoff) 03NEW [09:39:05] 06Traffic, 13Patch-For-Review: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737#10650858 (10Vgutierrez) [09:47:39] 10netops, 06Infrastructure-Foundations, 10ops-drmrs: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10650888 (10ayounsi) Port moved and still the same issue. I asked them (in French) if the patch got properly changed, and to call me on my mobile to discuss it more in details. [09:49:49] 06Traffic, 10Liberica: Provide DNS healthchecks - https://phabricator.wikimedia.org/T389211#10650906 (10Vgutierrez) p:05Triage→03Medium [09:56:29] 06Traffic, 10Data-Engineering (Q3 2025 January 1st - March 31th), 10DPE HAProxy Migration, 13Patch-For-Review: Fix `webrequest_frontend` kafka timestamp mismatch with in-data `dt` field - https://phabricator.wikimedia.org/T388397#10650919 (10JAllemandou) >>! In T388397#10650404, @Vgutierrez wrote: > Maybe... [10:10:16] 06Traffic, 10Data-Engineering (Q3 2025 January 1st - March 31th), 10DPE HAProxy Migration, 13Patch-For-Review: Fix `webrequest_frontend` kafka timestamp mismatch with in-data `dt` field - https://phabricator.wikimedia.org/T388397#10650959 (10JAllemandou) I answered to a comment on the gitlab PR (https://gi... [10:24:56] 06Traffic, 13Patch-For-Review: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737#10651078 (10Vgutierrez) [11:09:44] 06Traffic, 13Patch-For-Review: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737#10651238 (10Vgutierrez) [11:11:47] 06Traffic, 13Patch-For-Review: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737#10651257 (10Vgutierrez) 05In progress→03Stalled ulsfo done: ` vgutierrez@cumin1002:~$ sudo cumin 'A:cp-ulsfo' 'dpkg -l | egrep "(varnish|vmod)"' 16 hosts will be targeted: cp[4037-4052].ulsfo.wmnet OK... [11:44:13] 06Traffic: requestctl bandwidth limit has incorrect syntax - https://phabricator.wikimedia.org/T388529#10651411 (10Joe) p:05Triage→03High [12:30:32] Hi, heads up that now we are at 25% of the steps. It takes some time to fully propagate (CDN, parser cache) but we are getting there [13:12:54] 06Traffic, 10DNS, 06SRE: Migrate PDNS recursor config to use /etc/powerdns/recursor.d ? - https://phabricator.wikimedia.org/T389333#10651808 (10ssingh) Thanks for filing this task! This is a good idea and we can do it under the work planned for the pdns-recursor 5.x upgrade, mentioned in T381608. [13:12:56] 06Traffic, 10DNS, 06SRE: Migrate PDNS recursor config to use /etc/powerdns/recursor.d ? - https://phabricator.wikimedia.org/T389333#10651810 (10ssingh) [13:12:57] 06Traffic, 06SRE: Upgrade pdns-recursor to 5.x on all prod DNS hosts (all C:dnsrecursor and so possibly WMCS) - https://phabricator.wikimedia.org/T381608#10651811 (10ssingh) [13:23:42] 06Traffic, 06SRE: Upgrade pdns-recursor to 5.x on all prod DNS hosts (all C:dnsrecursor and so possibly WMCS) - https://phabricator.wikimedia.org/T381608#10651857 (10MoritzMuehlenhoff) Just a note: Debian trixie will be released in June or July and as for past releases we'll most certainly have the base layer... [13:27:13] 06Traffic, 06SRE: Upgrade pdns-recursor to 5.x on all prod DNS hosts (all C:dnsrecursor and so possibly WMCS) - https://phabricator.wikimedia.org/T381608#10651869 (10ssingh) >>! In T381608#10651857, @MoritzMuehlenhoff wrote: > Just a note: Debian trixie will be released in June or July and as for past releases... [14:21:15] 06Traffic, 10Data-Engineering (Q3 2025 January 1st - March 31th), 10DPE HAProxy Migration, 13Patch-For-Review: Fix `webrequest_frontend` kafka timestamp mismatch with in-data `dt` field - https://phabricator.wikimedia.org/T388397#10652133 (10Ottomata) BTW, there was a request to do this for varnishkafka, b... [14:47:47] 10netops, 06Infrastructure-Foundations, 10ops-drmrs: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10652276 (10RobH) I saw your reply and was about to ping in IRC to thank you for discussing in French with them directly. My fear is there is a language barrier and perhaps... [16:08:46] 06Traffic, 13Patch-For-Review: acme_chief and sslcert modules should allow destination parameter - https://phabricator.wikimedia.org/T387929#10652974 (10Fabfur) 05Open→03Resolved [16:09:10] 06Traffic: Allow acmecerts to deploy certificates in tmpfs storage - https://phabricator.wikimedia.org/T384227#10652976 (10Fabfur) 05Open→03In progress p:05Triage→03Medium [16:19:24] hey folks [16:19:33] is it a good time to restart some pybals? [16:19:59] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128345 [16:20:06] remove two ports for the kartotherian vip [16:22:33] go for it [16:29:31] 10Domains, 06Traffic: [toolforge] transfer/adopt toolsbeta.org domain to the foundation - https://phabricator.wikimedia.org/T362253#10653073 (10Andrew) a:05dcaro→03Andrew [16:30:24] ack proceeding [16:31:54] 06Traffic: Fix up/modernize the varnish upgrade cookbooks - https://phabricator.wikimedia.org/T389387 (10BCornwall) 03NEW [16:33:47] running puppet on 2014 + pybal restart [16:34:04] 10netops, 06Infrastructure-Foundations, 10ops-drmrs: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10653129 (10RobH) [16:34:12] 06Traffic: Fix up/modernize the varnish upgrade cookbooks - https://phabricator.wikimedia.org/T389387#10653130 (10BCornwall) 05Open→03In progress p:05Triage→03Medium [16:38:04] sukhe: clarification for https://wikitech.wikimedia.org/wiki/LVS#Remove_a_load_balanced_service - after each pybal restart I need to run ipvsadm --delete-service right? [16:38:32] ipvsadm --delete-service --tcp-service 10.2.1.13:{6533,443} in this case [16:38:42] yes please [16:38:45] just double check the IP [16:39:00] kartotherian.svc.codfw.wmnet [16:39:06] and the equivalent for eqiad when you get there [16:39:10] all right [16:39:59] worked nicely [16:40:10] nice [16:42:36] 2013 done as well, proceeding to eqiad [16:42:59] <3 [16:44:05] elukey: the eqiad IP should be 10.2.2.13 [16:44:28] maybe the !log has it incorrectly but yeah [16:44:29] 12:43:26 < elukey> !log restart pybal on lvs10[19,20] and run ipvsadm --delete-service --tcp-service 10.2.1.13:{443,6533} [16:44:29] yep yep [16:44:36] I'll fix the log [16:44:37] my bad [16:45:01] fixed [16:45:01] no worries [16:45:45] 1020 done, doing 1019 [16:47:15] all done! [16:47:38] elukey: [16:47:38] ('CRITICAL: Mismatch between IPVS and PyBal [16:47:39] ', "Hosts known to PyBal but not to IPVS: set(['maps1010.eqiad.wmnet', 'maps1009.eqiad.wmnet', 'maps1008.eqiad.wmnet', 'maps1007.eqiad.wmnet', 'maps1005.eqiad.wmnet'])") [16:48:13] let's check what's up [16:48:26] sukhe: maybe it is just missing https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128346/2 ? [16:49:19] ipvsadm doesn't show the old ports anymore [16:49:37] so I guess that the maps hosts are not associated with anything and pybal is not happy [16:50:30] elukey: yes let's merge that also [16:50:56] all right proceeding [16:52:17] done [16:53:23] looking [16:54:05] I see recoveries [16:54:07] cool, I am restarting pybal's again [16:54:12] yeah, the config changed so [16:54:16] no worries, leave that to me [16:54:20] ah snap sorry :( [16:54:21] what else is left in the cleanup? [16:54:32] nothing from the LVS side [16:54:46] cool :) [16:54:52] thanks a lot! <3 [16:55:04] anytime, you do all the work [17:54:28] 06Traffic: varnishkafka needs to be linked against libvarnishapi3 - https://phabricator.wikimedia.org/T389322#10653568 (10BCornwall) varnishkafka 1.1.0-5 has been imported into component/varnish-staging and links properly: ` brett@cp4041:~$ readelf -d /usr/bin/varnishkafka | grep libvarnishapi 0x000000000000000... [18:29:17] 06Traffic: upgrade to trafficserver 9.2.9 - https://phabricator.wikimedia.org/T388035#10653916 (10BCornwall) [19:57:30] 06Traffic: varnishkafka needs to be linked against libvarnishapi3 - https://phabricator.wikimedia.org/T389322#10654245 (10BCornwall) 05Open→03Resolved [21:04:20] 06Traffic: upgrade to trafficserver 9.2.9 - https://phabricator.wikimedia.org/T388035#10654498 (10BCornwall) [21:04:27] 06Traffic: upgrade to trafficserver 9.2.9 - https://phabricator.wikimedia.org/T388035#10654500 (10BCornwall) 05In progress→03Resolved