[07:19:30] <_joe_> why does haproxy for uploads have the acl mathoid_as_a_service used in a lot of if clauses? [07:27:33] <_joe_> gosh the number of small mistakes and inconsistencies in our caching layer [07:30:30] _joe_: good morning to you too :) [07:30:50] I think it's related to some work fabfur applied to simplify the haproxy configuration [07:31:59] that's https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135827 [07:32:27] but the previous version (hiera based) was on hieradata/common/profile/cache/haproxy.yaml [07:32:39] so it was already impacting both text and upload [07:33:15] and yes... acls related to mathoid in the upload cluster are just wasting CPU cycles [07:34:26] <_joe_> vgutierrez: good morning :D [07:34:57] so right now we don't have an easy way of fixing that without reverting I6116a8a2bdd468702e8f5afaa3a48ea106d8af12 [07:35:08] <_joe_> I'm reviewing the configurations of upload as I'm trying to make diagrams of present/future like I did for text here https://gitlab.wikimedia.org/oblivian/diagrams/-/tree/main/images?ref_type=heads [07:35:14] so I'll let fabfur think what's the best approach [07:35:16] <_joe_> so I find small things [07:35:44] <_joe_> tbh? I would wait before refactoring things more until we're refactoring all request filtering later [07:36:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [07:44:41] mmm [07:44:43] let me check [07:47:47] I didn't removed the possibility to use hiera, if needed we can move this alone back to hiera [07:51:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [07:51:53] fabfur: given the conditionals you need to rollback everything related to concurrency limits as well [07:52:04] yes [08:03:32] 06Traffic: Benchmark differnet options - https://phabricator.wikimedia.org/T393671 (10Fabfur) 03NEW [08:04:19] 06Traffic: Map ISPs in Maxmind db, used in turnilo/superset, to use in requestctl rule - https://phabricator.wikimedia.org/T392219#10803351 (10Fabfur) 05Open→03In progress [08:07:55] 06Traffic: Benchmark different options - https://phabricator.wikimedia.org/T393671#10803354 (10Vgutierrez) [08:37:10] 06Traffic: Benchmark different options - https://phabricator.wikimedia.org/T393671#10803499 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d93798dd-52a2-4ea7-a8be-b23ed9ee945d) set by fabfur@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Testing in progress ` cp700... [08:43:38] FIRING: [4x] LVSRealserverMSS: Unexpected MSS value on 195.200.68.224:443 @ cp7001 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [08:45:27] that's fabfur messing up with haproxy@cp7001 :) [08:45:41] mmm I downtimed the alerts [08:45:58] I'll ack this too [09:00:19] 06Traffic, 06Experimentation Lab, 13Patch-For-Review: SDS 2.4.4 Edge Uniques Production Cookie Deployment - https://phabricator.wikimedia.org/T391411#10803562 (10Vgutierrez) [09:57:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [10:07:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [11:52:25] FIRING: SystemdUnitFailed: acme-chief.service on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:53:21] uh? [11:53:43] why is acmechief running on 2002? [11:55:40] walking the dogs.. I'll check that ASAP [11:57:25] RESOLVED: SystemdUnitFailed: acme-chief.service on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:03:31] hmm when we moved from acmechief1002 to acmechief2002? [12:05:05] 527e9070cfb (Brett Cornwall 2025-04-28 12:38:29 -0700 19) profile::acme_chief::active: acmechief2002.codfw.wmnet [12:06:12] ok, we don't need to swap active hosts for reboots [12:06:58] common.yaml:acmechief_host: 'acmechief1002.eqiad.wmnet' [12:07:15] so now we got acmechief2002 issuing certs but clients using acmechief1002 to fetch them [12:08:23] brett: for future acme-chief reboots please follow https://wikitech.wikimedia.org/wiki/Service_restarts#acmechief_hosts [12:10:35] May 08 11:52:08 acmechief2002 acme-chief-backend[1574451]: ACME Directory has rejected the challenge(s) for certificate non-canonical-redirect-4 / ec-prime256v1 [12:10:39] hmm [12:12:28] "detail": "Incorrect TXT record \"v=spf1 include:_cidrs.wikimedia.org include:_spf.google.com ip4:74.121.51.111 ~all\" (and 6 more) found at _acme-challenge.pywikipedia.org", [12:12:39] that's definitely weird (and wrong) [12:13:05] vgutierrez@carrot:~$ host -t ns pywikipedia.org [12:13:05] pywikipedia.org is an alias for wikimedia.org. [12:13:11] err.... [12:15:09] FIRING: [8x] LVSHighCPU: The host lvs5005:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5005 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [12:15:17] ok... we lost control of pywikipedia.org [12:15:32] Name Server: ns061.auroradns.eu [12:15:32] Name Server: ns062.auroradns.nl [12:15:32] Name Server: ns063.auroradns.info [12:15:39] or we never got it, dunno [12:20:09] RESOLVED: [8x] LVSHighCPU: The host lvs5005:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5005 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [13:21:45] 06Traffic, 06DC-Ops, 10ops-esams, 06SRE: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616#10804369 (10ssingh) [14:05:17] 06Traffic, 10RESTBase, 10RESTBase Sunsetting, 06serviceops, 10Content-Transform-Team (Work In Progress): Block external traffic to RESTBase /page/data-parsoid endpoint and investigate internal usage - https://phabricator.wikimedia.org/T393557#10804651 (10MSantos) [14:05:29] 06Traffic, 10RESTBase, 10RESTBase Sunsetting, 06serviceops, and 2 others: Block external traffic to RESTBase /page/data-parsoid endpoint and investigate internal usage - https://phabricator.wikimedia.org/T393557#10804652 (10MSantos) [15:05:04] vgutierrez: Sorry about that. I'll follow it [15:05:27] pywikipedia.org was never under set to our dns servers. I've asked on their mailing list to fix it [15:05:51] so we cannot issue certs against it [15:07:27] ack [15:08:09] I'll just remove it then, it's been a few weeks without reply and the domain hasn't worked for a long time [15:08:19] that's already done [15:09:28] TBH I'm incline to close T388809 as invalid [15:09:29] T388809: pywikipedia.org is using wildcard cert for production projects - https://phabricator.wikimedia.org/T388809 [15:09:42] I meant fully - dns et al [15:09:45] it's just a "random" domain pointed to our CDN [15:09:50] yeah [15:10:06] brett: yes, please go ahead with the proper cleanup :D [15:10:14] aye aye. Sorry for the trouble [15:10:20] no problem [15:11:39] https://gerrit.wikimedia.org/r/c/operations/dns/+/1143595 [15:14:00] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143597 [16:08:08] 06Traffic, 06Experimentation Lab, 13Patch-For-Review: SDS 2.4.4 Edge Uniques Production Cookie Deployment - https://phabricator.wikimedia.org/T391411#10805188 (10Vgutierrez) [16:26:44] 10netops, 06Infrastructure-Foundations, 06SRE: Do we need prometheus-ethtool-exporter? - https://phabricator.wikimedia.org/T371375#10805253 (10akosiaris) +1 for what is worth. [16:46:10] 10netops, 06Infrastructure-Foundations, 06SRE: Do we need prometheus-ethtool-exporter? - https://phabricator.wikimedia.org/T371375#10805304 (10Vgutierrez) We could definitely use that kind of data :) [16:49:51] 06Traffic: Update libvmod-netmapper to 1.10 - https://phabricator.wikimedia.org/T392533#10805309 (10BCornwall) [16:56:08] RESOLVED: [4x] LVSRealserverMSS: Unexpected MSS value on 195.200.68.224:443 @ cp7001 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [17:22:38] FIRING: [3x] LVSRealserverMSS: Unexpected MSS value on 185.15.59.224:443 @ cp3070 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=esams&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [17:27:38] FIRING: [7x] LVSRealserverMSS: Unexpected MSS value on 185.15.59.224:443 @ cp3066 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=esams&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [17:32:38] RESOLVED: [5x] LVSRealserverMSS: Unexpected MSS value on 185.15.59.224:443 @ cp3066 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=esams&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [18:29:59] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10805607 (10ssingh) [18:41:02] anyone from traffic around in 1-2 hours? we'd like to migrate query.wikidata.org to `wdqs-main` (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1139531). when we tried it a couple days ago we got 502's but we think https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143194 will have solved the issue, but might be good to have someone from traffic around just incase [18:41:23] brett can help if he is around ^ [18:42:00] process will be very simple, basically merge backend.yaml patch, run puppet on cp text hosts, and watch wdqs graphs [18:43:40] yeah, I'll be here [19:02:33] brett: any preferences on time? inflatador and i can do anytime btw t+30mins and t+2h. earlier probably better but we can work around your schedule [19:03:33] I'm free through all that time, so whenever's best for you [19:05:43] 06Traffic, 13Patch-For-Review: varnish 7.1.1-1.1~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T391334#10805716 (10BCornwall) 05Stalled→03Resolved a:03BCornwall It seems like we've had good fortune and not experienced a crash. I'm going to be bold and close this - if misfortune should hit us a... [19:06:27] cool let's do in 25 mins [19:06:55] aye aye [19:33:44] I guess let's wait for grafana to get back in service ^^ [19:34:45] agreed [19:51:53] 06Traffic: ncmonitor should ignore invalid duplicate MarkMonitor domains - https://phabricator.wikimedia.org/T393734 (10BCornwall) 03NEW [19:52:18] grafana's back, let's give it a few mins though and make sure things settle [19:53:08] 06Traffic: ncmonitor should ignore invalid duplicate MarkMonitor domains - https://phabricator.wikimedia.org/T393734#10805825 (10BCornwall) [19:53:37] 06Traffic: ncmonitor should ignore invalid duplicate MarkMonitor domains - https://phabricator.wikimedia.org/T393734#10805826 (10BCornwall) p:05Triage→03Low [19:55:12] 06Traffic: ncmonitor should ignore invalid duplicate MarkMonitor domains - https://phabricator.wikimedia.org/T393734#10805834 (10BCornwall) [20:02:39] brett: inflatador: yall comfortable moving ahead? [20:02:49] sure, ready when you are [20:02:53] just about to message. hit it! [20:03:18] https://meet.google.com/ozu-gdro-zxg [20:03:41] oh, it's Thursday afternoon. Okta time for me ;) [20:04:42] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143659 [20:55:31] 06Traffic: Update libvmod-netmapper to 1.10 - https://phabricator.wikimedia.org/T392533#10805978 (10BCornwall) [21:25:31] Before I waste energy with a poor implementation, could some of you fine folks look at my idea for making it easier to poke holes in the blocked_nets setting we are using in Beta Cluster? https://phabricator.wikimedia.org/T393481#10802783 [21:25:56] This would be changes to wikimedia-frontend.vcl.erb [21:26:38] feature flagged somehow of course [22:38:26] 06Traffic, 06DC-Ops, 10ops-esams, 06SRE: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616#10806152 (10RobH) url provided by support so i've uploaded the support collection report for their review [22:53:54] 06Traffic, 10Beta-Cluster-Infrastructure: Add allowlist to make poking holes in abuse_networks:blocked_nets:networks easier - https://phabricator.wikimedia.org/T393481#10806169 (10BCornwall) [22:59:49] bd808: I'm not the authority on this stuff here but it seems like the reasonable way forward to me. [23:00:33] From a procedure standpoint, though... who's going to be maintaining this list? [23:11:05] brett: me and others in the deployment-prep project. We use Horizon to inject hiera settings for the project. [23:11:25] makes sense [23:11:59] https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep#Blocking is the thing this would help with [23:13:07] bots crushed things such that I blocked ~12% of the IPv4 space and have been slowly poking holes for real people who ask. The CIDR math is tedious so we would like to add in an allow list basically. [23:21:50] Yeah, seems reasonable to me. But I can't speak for v.gut or s.uk