[01:29:41] 06Traffic: ncmonitor should ignore invalid duplicate MarkMonitor domains - https://phabricator.wikimedia.org/T393734#11117553 (10BCornwall) 05In progress→03Resolved [01:56:32] 06Traffic, 06Commons, 06DBA, 06SRE: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11117609 (10Novem_Linguae) The user replied "calm down" instead of making the requested change. Not a great sign. Agree that maybe a sysadmin should just make t... [02:04:11] 06Traffic, 06Commons, 06DBA, 06SRE: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11117617 (10Zache) >>! In T402749#11117609, @Novem_Linguae wrote: > The user replied "calm down" instead of making the requested change. Not a great sign. Agree... [03:52:13] 06Traffic, 06Commons, 06DBA, 06SRE: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11117660 (10Josve05a) Only an interface sysops will be able to edit the user's specific .js pages (not a mere regular sysops as myself), but unless they act the... [03:54:26] 06Traffic, 06SRE, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11117666 (10DavidBrooks) You ask clients to respect HTTP code 429 Too Many Requests. Returning to AutoWikiBrowser: the current code will simply throw a fai... [09:07:25] FIRING: SystemdUnitCrashLoop: varnish-frontend-fetcherr.service crashloop on cp5020:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [09:12:25] FIRING: [6x] SystemdUnitCrashLoop: varnish-frontend-slowlog.service crashloop on cp1112:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [09:17:25] FIRING: [7x] SystemdUnitCrashLoop: varnish-frontend-slowlog.service crashloop on cp1112:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [09:18:41] great.. that's me :) [09:22:25] FIRING: [12x] SystemdUnitCrashLoop: varnish-frontend-slowlog.service crashloop on cp1112:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [09:27:25] FIRING: [12x] SystemdUnitCrashLoop: varnish-frontend-slowlog.service crashloop on cp1112:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [09:28:28] FIRING: SystemdUnitCrashLoop: varnish-frontend-slowlog.service crashloop on cp1109:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [09:32:25] FIRING: [15x] SystemdUnitCrashLoop: varnish-frontend-slowlog.service crashloop on cp1112:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [09:37:25] FIRING: [15x] SystemdUnitCrashLoop: varnish-frontend-slowlog.service crashloop on cp1112:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [09:38:28] RESOLVED: [2x] SystemdUnitCrashLoop: varnish-frontend-slowlog.service crashloop on cp1107:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [09:42:25] FIRING: [18x] SystemdUnitCrashLoop: varnish-frontend-slowlog.service crashloop on cp1112:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [09:47:25] RESOLVED: [16x] SystemdUnitCrashLoop: varnish-frontend-slowlog.service crashloop on cp1112:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:16:22] 06Traffic, 10Hiddenparma: Add known-client-ingestion-source objects an logic - https://phabricator.wikimedia.org/T402014#11118807 (10JMeybohm) I had a discussion with @Joe regarding the validation/safeguard topic and we decided to kick it down the road for now. It is a fairly complex topic and pretty hard to g... [13:36:15] 06Traffic, 06SRE, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11119143 (10Urbanecm_WMF) > Our goal is to block all traffic from unidentified clients and not coming from authorized actors, like toolsforge or our intern... [13:59:59] 06Traffic, 06SRE, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11119288 (10Vgutierrez) >>! In T400119#11119143, @Urbanecm_WMF wrote: >> Our goal is to block all traffic from unidentified clients and not coming from aut... [14:22:51] 06Traffic, 06cloud-services-team, 10Cloud-VPS, 10DNS, 06SRE: PDNS in cloud can return inconsistent answers - https://phabricator.wikimedia.org/T281700#11119387 (10ssingh) 05Open→03Resolved a:03ssingh Some quick notes: - We are running `pdns-recursor` 4.8 in production, with an upgrade to 5 in... [15:23:57] 06Traffic, 06Data-Engineering, 13Patch-For-Review: Reduce noise from duplicate sequence-gap alerts on HaProxy-webrequests - https://phabricator.wikimedia.org/T401383#11119641 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez I'm closing this since we've fixed the wrong behavior on HAProxy regarding s... [15:24:16] 06Traffic: varnish-frontend-slowlog service restarts with decoding error - https://phabricator.wikimedia.org/T402634#11119648 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [15:30:23] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 06SRE: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11119711 (10Jclark-ctr) a:05Jclark-ctr→03None [15:45:09] FIRING: LVSHighCPU: The host lvs1018:9100 has at least its CPU 14 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1018 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [15:50:09] RESOLVED: LVSHighCPU: The host lvs1018:9100 has at least its CPU 14 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1018 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [15:50:37] 06Traffic, 06SRE, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11119921 (10thcipriani) >>! In T400119#11089795, @Joe wrote: >>>! In T400119#11086977, @bd808 wrote: >>>>! In T400119#11084530, @Samwilson wrote: >>> ~~Wil... [15:56:05] 06Traffic, 06Data-Engineering: Export development_network_probe data to Puppet servers for CDN deployment - https://phabricator.wikimedia.org/T402512#11119965 (10Ottomata) [15:59:09] 06Traffic, 06Data-Engineering: Export development_network_probe data to Puppet servers for CDN deployment - https://phabricator.wikimedia.org/T402512#11119987 (10Ottomata) Another option that people do, but not one I would recommend: - use the https://wikitech.wikimedia.org/wiki/Data_Platform/Web_publication... [16:03:46] hello traffic friends - a little while ago, we updated nginx on conf2004 (cluster member used by pybals in codfw). this disrupted existing connections as nginx restarted, and pybals seem to have been recovered fine (e.g., the `PyBal connections to etcd` check converged again quickly after). [16:03:46] I'm trying to weigh whether a round of pybal restarts makes sense in addition, just out of an abundance of caution. input would be greatly appreciated :) [16:04:01] if the answer is yes, then I'm happy to drive those via the cookbook [16:04:34] IIRC there was really no need to do that? [16:04:48] but yeah, go for it if you want out of an abundance of caution -- we have no ongoing maintenance yet [16:06:00] so, here's the thing: for the work we _originally_ planned (just the switch to using cfssl-based certs), we were considering doing so simply to force pybal to establish new TLS connections [16:06:27] what we ended up doing was to also bundle in an nginx upgrade, to consolidate "validation ceremony" into one event :) [16:07:08] in theory, we "shouldn't need" the pybal restarts anymore in order to verify TLS connections on the new certs (since the nginx restart did that for us) [16:07:40] the main question is, as always, how well pybal actually handles these kinds of upstream disruptions [16:08:07] 06Traffic, 10envoy, 06serviceops, 06SRE, 13Patch-For-Review: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11120051 (10MoritzMuehlenhoff) >>! In T402584#11115783, @RLazarus wrote: >>>! In T402584#11113754, @MoritzMuehlenhoff wrote: >> We also have 237 baremetal host... [16:09:15] swfrench-wmf: ok. let's keep an eye out and do the restarts if necessary, no timing issues for us at least so no cause for concern [16:09:18] thanks for checking as always! [16:09:55] sukhe: thanks! what I might do is go ahead and restart the secondary, just to see whether there's any substantial change in behavior vs. the primaries [16:10:11] (or rather, difference in behavior around interactions with etcd) [16:10:15] yep sounds good [16:10:21] (that's what we do too :) [16:11:49] * swfrench-wmf is doing [16:18:39] sukhe: no issues on the secondary after restart, so I think we're good to let things sit as they are. unless you have any concerns, I'll stop here without touching the primaries :) [16:20:59] swfrench-wmf: sounds good and thanks! [16:58:45] 06Traffic, 06Data-Engineering: Request for a new request dataset for caching research - https://phabricator.wikimedia.org/T401331#11120372 (10ssingh) Thanks for creating this task and the request. We will be discussing it shortly within Traffic and Data Engineering and we will follow up. [17:24:51] FIRING: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2010 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [17:29:51] RESOLVED: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2010 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [17:56:37] 06Traffic, 10envoy, 06serviceops, 06SRE, 13Patch-For-Review: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11120660 (10Dzahn) [18:15:43] FIRING: HaproxyKafkaSocketDroppedMessages: Unexpected rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=eqsin&var-instance=cp5017&viewPanel=panel-19 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [18:20:15] 06Traffic: ncmonitor should not submit new CRs if there are still some yet to be reviewed - https://phabricator.wikimedia.org/T368694#11120779 (10BCornwall) 05Open→03In progress [18:20:43] RESOLVED: [2x] HaproxyKafkaSocketDroppedMessages: Unexpected rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [18:22:17] 06Traffic, 06SRE, 10Wikidata, 10Wikidata-Query-Service: Find a solution for SPARQL federation that is blocked by stricter user agent policy enforcement - https://phabricator.wikimedia.org/T402959 (10Lydia_Pintscher) 03NEW [18:22:47] 06Traffic: ncmonitor should check real-world NS records - https://phabricator.wikimedia.org/T402960 (10BCornwall) 03NEW [18:22:52] 06Traffic: ncmonitor should check real-world NS records - https://phabricator.wikimedia.org/T402960#11120803 (10BCornwall) p:05Triage→03Medium [18:24:17] 06Traffic: ncmonitor should verify that DNSSEC is disabled in MarkMonitor - https://phabricator.wikimedia.org/T402961 (10BCornwall) 03NEW [18:24:21] 06Traffic: ncmonitor should verify that DNSSEC is disabled in MarkMonitor - https://phabricator.wikimedia.org/T402961#11120816 (10BCornwall) p:05Triage→03Medium [18:29:51] 06Traffic, 06SRE, 13Patch-For-Review: apt-staging: add headers to prevent CDN caching - https://phabricator.wikimedia.org/T402284#11120837 (10ssingh) >>! In T402284#11101087, @fnegri wrote: > @Dzahn fine with me, but if there's an easy way to keep e.g. a 5-minute cache it could be nice to have. I'll let @Joe... [18:42:18] 06Traffic, 06SRE, 13Patch-For-Review: apt-staging: add headers to prevent CDN caching - https://phabricator.wikimedia.org/T402284#11120914 (10MoritzMuehlenhoff) >>! In T402284#11120837, @ssingh wrote: >>>! In T402284#11101087, @fnegri wrote: >> @Dzahn fine with me, but if there's an easy way to keep e.g. a 5... [19:23:25] 10netops, 06Infrastructure-Foundations, 06SRE: Investigate using BGP addpath for unicast IBGP spine/leaf pods - https://phabricator.wikimedia.org/T402640#11121128 (10ayounsi) If I understand correctly we currently get some "per rack" load balancing, where `E3` might randomly prefer `E1` but servers in `E4` m... [22:57:39] 06Traffic, 10envoy, 06serviceops, 06SRE: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11121811 (10RLazarus) Validated on mathoid and mw-debug (mathoid still on envoy-future, mw-debug back on 1.23 for now). One config warning in the logs from mw-debug: ` [2025-08-26... [22:58:59] 06Traffic, 06Commons, 06DBA, 06SRE: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11121812 (10Josve05a) For preservation of information the user blanked both their gadget script and their common.js following a request on their user talk page.... [23:29:52] FIRING: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2012 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [23:34:42] 06Traffic, 10envoy, 06serviceops, 06SRE: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11121920 (10RLazarus) More deprecation warnings from the API Gateway (started locally after modifying charts/api-gateway/values-devel.yaml to use envoy-future: ` [source/common/pro... [23:34:51] RESOLVED: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2012 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS