[04:45:43] FIRING: [13x] HaproxyKafkaSocketDroppedMessages: Unexpected rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [04:50:43] RESOLVED: [38x] HaproxyKafkaSocketDroppedMessages: Unexpected rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [06:04:45] 06Traffic, 10Hiddenparma, 13Patch-For-Review: Requestctl should use x-provenance header - https://phabricator.wikimedia.org/T396621#11105316 (10Joe) 05Open→03Resolved [06:36:47] 10netops, 06Infrastructure-Foundations, 06SRE: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845#11105337 (10ayounsi) Note we need to keep in mind that the main goal here is to move the mgmt routers to use BGP instead of OSPF. It's fine to do some light recabling if it m... [07:16:01] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Eqiad: new structured cabling needed between cages to eqiad 2025/6 switch refresh - https://phabricator.wikimedia.org/T402432#11105392 (10ayounsi) [07:18:49] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Eqiad: new structured cabling needed between cages to eqiad 2025/6 switch refresh - https://phabricator.wikimedia.org/T402432#11105396 (10ayounsi) Sounds good ! It would also be fine to route temporarily through the core routers depend... [09:57:04] stevemunene: are you taking care of restarting pybal on codfw or should I take care? [09:58:12] I would appreciate some help on that vgutierrez , thanks [09:58:31] what's the current status? patch has been merged and that's it? [09:59:10] yes patch merged, was preparing the restart steps [09:59:30] have you disabled puppet on the impacted LBs? [09:59:55] not yet [10:00:32] that needs to happen before merging it or you don't have any guarantees of puppet not catching your change as soon as it's merged [10:03:02] Ack, on both eqiad and codfw? [10:07:18] just codfw [10:08:14] impacted lvs.. lvs[2013-2014].codfw.wmnet [10:08:36] or `A:lvs-low-traffic-codfw OR A:lvs-secondary-codfw` [10:10:53] Ack, then run puppet -> verify then restart pybal.service on the same [10:31:41] yes, you can use sre.loadbalancer.restart-pybal cookbook for that [10:38:25] Thanks! [11:06:25] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker2002.codfw.wmnet are marked down but pooled [11:06:36] it looks like you pooled the servers without being able to serve traffic stevemunene [11:40:31] Actively working on this, seems like we missed something [12:17:02] 06Traffic, 10Hiddenparma: Add known-client-ingestion-source objects an logic - https://phabricator.wikimedia.org/T402014#11106090 (10Vgutierrez) [12:36:11] 06Traffic, 06Data-Engineering: Export development_network_probe data to Puppet servers for CDN deployment - https://phabricator.wikimedia.org/T402512 (10Vgutierrez) 03NEW [12:36:25] 06Traffic, 06Data-Engineering: Export development_network_probe data to Puppet servers for CDN deployment - https://phabricator.wikimedia.org/T402512#11106139 (10Vgutierrez) p:05Triage→03Medium [12:40:45] 06Traffic, 06Data-Engineering: Export development_network_probe data to Puppet servers for CDN deployment - https://phabricator.wikimedia.org/T402512#11106154 (10Vgutierrez) [12:59:06] stevemunene: as noted in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1178834/comments/a89326fb_293b33ed, for future, perhaps it may be a good idea to sync with us for such deploys [12:59:42] the primary reason is and as noted on https://wikitech.wikimedia.org/wiki/LVS#Please_read_before_we_get_started... [12:59:53] that it's a good idea to get the patches ready and then merge them in the order listed on the page [13:00:39] that avoids issues with the deploy and in general since some of these may be tricky [13:01:13] rather https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service [13:01:36] in any case, let's keep this in mind for next time but for now, let us know if we can help [13:09:17] Well noted and apologies for the confusion and alerts, thanks for the support [13:11:04] no worries at all. for now though, are you stuck somewhere? can we help? [13:17:44] Currently stuck on https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Calico_node/controllers The sync is stuck setting up the calico-node controllers [13:27:34] 10netops, 06Infrastructure-Foundations, 06SRE: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845#11106403 (10Papaul) Understood [13:33:43] FIRING: [11x] HaproxyKafkaSocketDroppedMessages: Unexpected rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [13:38:43] RESOLVED: [38x] HaproxyKafkaSocketDroppedMessages: Unexpected rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [16:59:57] bd808: please let me us know if we can help with debugging T402557? [16:59:58] T402557: Project deployment-prep instance deployment-cache-text08 is down - https://phabricator.wikimedia.org/T402557 [17:00:13] brett: ^ for VCL stuff [17:01:11] thanks [17:22:55] 06Traffic, 06Data-Engineering: Reduce noise from duplicate sequence-gap alerts on HaProxy-webrequests - https://phabricator.wikimedia.org/T401383#11107896 (10Vgutierrez) I think I've identified the issue, right now haproxy always log `sequence: 0` for `` requests [17:43:53] 06Traffic: Use %lc (frontend_log_counter) for sub-sampling in webrequest_sampled_live - https://phabricator.wikimedia.org/T402573 (10CDanis) 03NEW [17:44:00] 06Traffic, 06Data-Engineering: Reduce noise from duplicate sequence-gap alerts on HaProxy-webrequests - https://phabricator.wikimedia.org/T401383#11107990 (10Vgutierrez) Right now we get the sequence number from haproxy `%rt` log format, that's `request_counter (HTTP req or TCP session)` according to its docum... [17:44:32] 06Traffic: Use %lc (frontend_log_counter) for sub-sampling in webrequest_sampled_live - https://phabricator.wikimedia.org/T402573#11107991 (10CDanis) [17:44:55] 06Traffic: Use %lc (frontend_log_counter) for sub-sampling in webrequest_sampled_live - https://phabricator.wikimedia.org/T402573#11107993 (10CDanis) →14Duplicate dup:03T401383 [17:44:58] 06Traffic, 06Data-Engineering: Reduce noise from duplicate sequence-gap alerts on HaProxy-webrequests - https://phabricator.wikimedia.org/T401383#11107995 (10CDanis) [17:56:19] 06Traffic, 06Data-Engineering, 13Patch-For-Review: Reduce noise from duplicate sequence-gap alerts on HaProxy-webrequests - https://phabricator.wikimedia.org/T401383#11108064 (10Vgutierrez) p:05Triage→03High flagging as high cause this is already making the downsampling in benthos fail (nice catch by @CD... [18:12:44] 06Traffic, 06Movement-Insights, 10Data-Engineering (Q1 FY25/26 July 1st - September 30th): NEW BUG REPORT: Investigate rise in May 2025 Reader metrics - https://phabricator.wikimedia.org/T395934#11108125 (10Mayakp.wiki) Movement Insights is currently testing 1 week of baseline (April) and Issue (May) data; a... [18:22:50] 10netops, 06Infrastructure-Foundations, 06SRE: Homer: Add Python modules to configure Nokia SR Linux switches - https://phabricator.wikimedia.org/T402577 (10cmooney) 03NEW p:05Triage→03Medium [18:22:59] 10netops, 06Infrastructure-Foundations, 06SRE: Homer: Add Python modules to configure Nokia SR Linux switches - https://phabricator.wikimedia.org/T402577#11108196 (10cmooney) [19:30:26] 06Traffic, 10envoy, 06serviceops, 06SRE: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 (10RLazarus) 03NEW [19:52:07] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad: row C/D switch refresh configuration task - https://phabricator.wikimedia.org/T402588 (10cmooney) 03NEW p:05Triage→03Medium [20:00:06] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Eqiad: row C/D switch refresh configuration task - https://phabricator.wikimedia.org/T402588#11108541 (10cmooney) [20:01:23] 10netops, 06Infrastructure-Foundations, 06SRE: codfw expansion: configure new Nokia switches in rows E/F - https://phabricator.wikimedia.org/T402590 (10cmooney) 03NEW p:05Triage→03Medium [20:01:38] 10netops, 06Infrastructure-Foundations, 06SRE: codfw expansion: configure new Nokia switches in rows E/F - https://phabricator.wikimedia.org/T402590#11108572 (10cmooney) [21:21:34] 06Traffic, 06DC-Ops, 10ops-esams, 10ops-magru, 13Patch-For-Review: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#11108817 (10RobH) 05In progress→03Resolved After discussion within both Traffic and DC Ops we're going to resolve this with the fans just running faster.