[07:21:04] 06Traffic, 06DC-Ops, 10ops-codfw, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11041918 (10elukey) I am wondering if the issue is related to how the current recipe defines the EFI partition: ` d-i partman-auto/expert_recipe string \ multiraid ::... [08:12:10] 06Traffic, 06SRE, 10SRE-SLO: Page on ATS backend errors relative to traffic - https://phabricator.wikimedia.org/T400675 (10fgiunchedi) 03NEW [09:24:40] FIRING: [8x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [09:34:40] FIRING: [8x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [09:39:40] FIRING: [8x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [09:44:40] RESOLVED: [8x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [09:58:03] 06Traffic, 13Patch-For-Review: Can't build haproxykafka package anymore - https://phabricator.wikimedia.org/T400620#11042270 (10Fabfur) 05Open→03Resolved [09:59:38] 06Traffic, 06DC-Ops, 10ops-codfw, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11042274 (10elukey) While trying a custom recipe and getting the same d-i error, I checked the d-i shell and this seems problematic: ` ~ # ls /dev/s* /dev/snapshot /de... [10:29:15] 06Traffic, 10Hiddenparma, 06SRE: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11042337 (10Joe) a:03Joe [10:40:11] Hey folks! I tried to work on the cp2XXX hosts but the last issue that I can see is that the two SSDs are not recognized, so only nvmes are showing up [10:40:34] No idea why, either I am missing something or the is an extra setup that we are missing [10:44:33] thanks elukey [10:48:28] elukey: new hardware problems like it's 1996 [10:49:11] these ones are very different from all the other dells, I really hope that we are not going to get more of this "fun" [10:49:17] so far it was only Supermicro [11:07:30] 06Traffic, 10HaproxyKafka: HaproxyKafka alert on too many dropped messages - https://phabricator.wikimedia.org/T400684 (10Fabfur) 03NEW [11:55:49] 06Traffic, 06collaboration-services, 10Gerrit, 06SRE: Document how to deploy changes to DNS repo without Gerrit working - https://phabricator.wikimedia.org/T336754#11042563 (10ABran-WMF) [12:15:02] 06Traffic, 06SRE, 10SRE-swift-storage: Cannot upload on Commons or even here - https://phabricator.wikimedia.org/T349671#11042628 (10MatthewVernon) 05Open→03Declined [I think we're beyond the point of useful investigation of this particular incident, and generally uploading does work] [12:52:33] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: VC link from asw2-c4-eqiad to asw2-c7-eqiad flapping - https://phabricator.wikimedia.org/T398612#11042752 (10Jclark-ctr) a:05cmooney→03Jclark-ctr [13:45:12] 06Traffic, 06Data-Engineering: Fix Hive event.development_network_probe table - https://phabricator.wikimedia.org/T400360#11043014 (10Ottomata) p:05Triage→03High FYI this is causing data-engineering-alerts email spam every hour. [14:12:31] Hello we're getting a lot of HaproxyKafkaDeliveryErrors `AlertLintProblem data-engineering (/srv/alerts/ops/team-data-engineering_haproxykafka.yaml localhost:9123 pint alerting HaproxyKafkaDeliveryErrors prometheus "ops" at http://127.0.0.1:9900/ops didn't have any series for "haproxykafka_saturation_errors" metric in the last 1w ops promql/series warning prometheus` could tis be related to T391810 [14:12:33] T391810: Replicate current low-message alerting from VarnishKafka - https://phabricator.wikimedia.org/T391810 [14:12:49] cc fabfur -^ [14:17:21] yes, it's a known issue, unfortunately I don't know how to fix this other than disabling the alert for HaproxyKafkaDeliveryErrors [14:17:44] as it's based on a metric that can or cannot be ther (the haproxykafka_saturation_errors) [14:24:16] stevemunene: that's a o11y bug, not ours [14:25:55] you can use https://gerrit.wikimedia.org/r/c/operations/alerts/+/1164957 as a guide on how to disable the linter for the impacted queries [14:26:52] 06Traffic, 06Data-Engineering: Fix Hive event.development_network_probe table - https://phabricator.wikimedia.org/T400360#11043291 (10Ottomata) FYI, the Refine Yarn application id that evolved the event.development_network_probe table is , and the ALTER that was run is: `application_1750705250302_912703` `lan... [14:30:58] 06Traffic: Add ability to validate JWTs in haproxy - https://phabricator.wikimedia.org/T400238#11043336 (10Vgutierrez) @Tgr thanks, what's the source of truth for JWT keys at the moment? are the public keys being exposed somehow? (assuming some asymmetric encryption algorithm is being used) [15:14:37] Thanks fabfur , vgutierrez [15:28:51] 06Traffic, 06Data-Engineering: Fix Hive event.development_network_probe table - https://phabricator.wikimedia.org/T400360#11043731 (10Ottomata) Also for reference, here is a link to the airflow mapped task (116) that evolved this table. https://airflow.wikimedia.org/task?dag_id=refine_to_hive_hourly&task_id=r... [15:36:31] 06Traffic, 06Data-Engineering: Fix Hive event.development_network_probe table - https://phabricator.wikimedia.org/T400360#11043769 (10Ottomata) I spent a little time today trying to figure out the right thing to do. If @cdanis etc. is okay with missing some historical data, I think we should: - drop the `eve... [15:37:27] 06Traffic, 06Data-Engineering: Fix Hive event.development_network_probe table - https://phabricator.wikimedia.org/T400360#11043771 (10CDanis) >>! In T400360#11043769, @Ottomata wrote: > I believe this would preserve the past 60(?) days of data, as well as keep the newly added `host` field in 1.1.0 of the schem... [15:42:26] 06Traffic, 06Data-Engineering: Fix Hive event.development_network_probe table - https://phabricator.wikimedia.org/T400360#11043815 (10mforns) @Ottomata Sounds good! If the record is formatted correctly we can assume the re-recreation of the table from scratch will work fine. [15:54:40] 06Traffic, 06Data-Engineering: Fix Hive event.development_network_probe table - https://phabricator.wikimedia.org/T400360#11043862 (10Ottomata) Testing. Ran ` spark3-submit --driver-cores 1 --master 'local[1]' --conf spark.yarn.maxAppAttempts=1 --conf write.spark.accept-any-schema=true --conf spark.hadoo... [16:03:26] 06Traffic, 10envoy, 06serviceops, 06SRE: Upgrade Envoy to >= 1.24 - https://phabricator.wikimedia.org/T380211#11043915 (10hnowlan) Ideally we'd need to go to 1.33 or later with this work - I have not scoped how many more complications this work will entail but for WE5.1.3 and future rate limiting work we'l... [16:35:43] 06Traffic: Migrate MarkMonitor redirection services over to ncredir - https://phabricator.wikimedia.org/T400731 (10BCornwall) 03NEW [16:36:49] 06Traffic: Migrate MarkMonitor redirection services over to ncredir - https://phabricator.wikimedia.org/T400731#11044045 (10BCornwall) p:05Triage→03Low [16:48:36] 06Traffic: Migrate MarkMonitor redirection services over to ncredir - https://phabricator.wikimedia.org/T400731#11044075 (10BCornwall) [17:14:09] 06Traffic: Add ability to validate JWTs in haproxy - https://phabricator.wikimedia.org/T400238#11044178 (10Vgutierrez) @Tgr we're seeing requests at least targeting `/w/api.php` shipping a JWT token on a `Authorization` header missing the `Bearer` schema.. so something like `Authorization: $TOKEN` rather than `A... [17:43:04] 06Traffic, 06DC-Ops, 10ops-codfw, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11044305 (10Jhancock.wm) There's two things i can think of. One is converting the disks to raid capable and either setting two RAID0 or one RAID1. Setting a "First Dev... [18:34:12] 06Traffic, 06Data-Engineering: Fix Hive event.development_network_probe table - https://phabricator.wikimedia.org/T400360#11044504 (10Ottomata) @mforns and I worked on this today, and it turned out that backfilling this data would be quite difficult. RefineToHiveDataset (the new Refine on Airflow CLI) only wo... [18:39:06] 06Traffic, 06Data-Engineering: Fix Hive event.development_network_probe table - https://phabricator.wikimedia.org/T400360#11044508 (10Ottomata) So status: - old data is in `event`.`development_network_probe_T400360_original`; Data in this table since the schema change was merged (2025-07-23T11:00:00 UTC ?) i... [18:45:25] 06Traffic, 06Data-Engineering: Fix Hive event.development_network_probe table - https://phabricator.wikimedia.org/T400360#11044553 (10Ottomata) [18:47:28] 06Traffic: New software: ProxyTester - https://phabricator.wikimedia.org/T400244#11044566 (10CDanis) Just a note that the existing config validity tests in puppet `modules/profile/files/cache/haproxy/tests` have been broken by `bullseye-backports` no longer existing on mirrors. [19:32:32] 06Traffic, 10envoy, 06serviceops, 06SRE: Upgrade Envoy to >= 1.24 - https://phabricator.wikimedia.org/T380211#11044850 (10RLazarus) [19:45:15] 06Traffic, 10envoy, 06serviceops, 06SRE: Upgrade Envoy to >= 1.24 - https://phabricator.wikimedia.org/T380211#11044872 (10RLazarus) a:03RLazarus >>! In T380211#11043915, @hnowlan wrote: > Ideally we'd need to go to 1.33 or later with this work Ack. We're running 1.23 and the current release is 1.35, so... [21:00:09] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Discrepencies with cableid & ports on some msw in c/d <-> msw1-eqiad - https://phabricator.wikimedia.org/T400159#11045143 (10wiki_willy) a:03Jclark-ctr [21:00:36] 06Traffic: Add ability to validate JWTs in haproxy - https://phabricator.wikimedia.org/T400238#11045148 (10Tgr) >>! In T400238#11043336, @Vgutierrez wrote: > what's the source of truth for JWT keys at the moment? There are two, private MediaWiki config and private puppet. See T392647#10802093 for details. > ar... [21:00:58] 06Traffic, 10MW-1.45-notes (1.45.0-wmf.12; 2025-07-29), 13Patch-For-Review: Consider using the alternate chain of Google Trust Services certificates - https://phabricator.wikimedia.org/T398596#11045151 (10CDanis) {F65689847} Works! Backported to `wmf12`, so it's currently live on group0, and will roll out... [21:02:11] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: msw1-eqiad: cable me0 dedicated mgmt port directly to the switch itself - https://phabricator.wikimedia.org/T400161#11045152 (10wiki_willy) a:03VRiley-WMF [21:14:32] 06Traffic, 06Data-Engineering: Fix Hive event.development_network_probe table - https://phabricator.wikimedia.org/T400360#11045172 (10CDanis) >>! In T400360#11044508, @Ottomata wrote: > I'll update the task description, and maybe set a scheduled reminder message in slack? :) That sounds good, thanks Andrew.... [21:16:47] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: eqiad: second frack parent tracking task - https://phabricator.wikimedia.org/T392006#11045175 (10wiki_willy) 05Open→03Resolved a:03RobH Resolving task, we will be installing two new Fundraising cabinets as a s... [22:18:51] FIRING: FermMSS: Unexpected MSS value on 10.2.1.30:9200 @ cirrussearch2091 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=elasticsearch - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [22:23:51] RESOLVED: FermMSS: Unexpected MSS value on 10.2.1.30:9200 @ cirrussearch2091 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=elasticsearch - https://alerts.wikimedia.org/?q=alertname%3DFermMSS