[00:01:34] 06Traffic, 06MediaWiki-Platform-Team, 10WikimediaDebug: X-Wikimedia-Debug cookie not routed correctly in Kubernetes on POST requests - https://phabricator.wikimedia.org/T397439#11052608 (10Tgr) 05Open→03Resolved Thanks! [00:05:25] FIRING: SystemdUnitFailed: clean-stale-certs.service on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:18:16] 06Traffic, 06SRE: Setting up Wikimedia Trust and Safety Help Center with Zendesk product: Seeking Guidance on host mapping - https://phabricator.wikimedia.org/T400952#11052623 (10Nemoralis) [00:57:13] 06Traffic, 10Phabricator, 06SRE: traffic from Discord and Slack unfurler service is blocked by phabricator.wikimedia.org - https://phabricator.wikimedia.org/T400540#11052679 (10AntiCompositeNumber) Yup, it's working now. Thanks! [01:02:37] 06Traffic, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06SRE, 13Patch-For-Review: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11052701 (10Bawolff) [Anyways, I adjusted the QuickInstantCo... [01:15:25] RESOLVED: SystemdUnitFailed: clean-stale-certs.service on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:44:49] 10Acme-chief, 10Beta-Cluster-Infrastructure, 07Beta-Cluster-reproducible, 13Patch-For-Review: Warning about /etc/acmecerts/unified contents during puppet run on deployment-cache-text08 & deployment-cache-upload08 - https://phabricator.wikimedia.org/T399419#11052715 (10BCornwall) 05Open→03In progress p:... [02:13:30] 06Traffic, 06DC-Ops, 10ops-esams, 10ops-magru, 13Patch-For-Review: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#11052738 (10BCornwall) a:05BCornwall→03RobH Re-assigning to @RobH: Rob, can you check the hot aisle in magru for us? [04:06:18] 06Traffic, 06SRE, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11052789 (10Joe) >>! In T400119#11051059, @Alien333 wrote: > Where does UAs like `MediaWiki-JS/1.45.0-wmf.12`, the defaults used by a plain `new mw.Api()` in an on-wiki script,... [04:10:34] 06Traffic, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06SRE: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11052791 (10Joe) >>! In T400881#11050371, @Bawolff wrote: > Are you suggesting inc... [04:12:27] 06Traffic, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06SRE: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11052794 (10Joe) >>! In T400881#11052701, @Bawolff wrote: > [Anyways, I adjusted t... [05:00:10] 06Traffic, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06SRE: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11052795 (10A_smart_kitten) >>! In T400881#11052791, @Joe wrote: >>>! In T400881#1... [06:00:16] 06Traffic, 10Phabricator, 06SRE: traffic from Discord and Slack unfurler service is blocked by phabricator.wikimedia.org - https://phabricator.wikimedia.org/T400540#11052828 (10Joe) Please note, this solution is temporary: bots working from clouds will break repeatedly if they're not properly identified with... [07:30:07] 06Traffic, 06SRE, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11052931 (10TheDJ) >>! In T400119#11052789, @Joe wrote: > Case in point, I can't find any request with that UA in the logs for the past few days. Indeed it's not in the list of... [08:04:16] 06Traffic, 10HaproxyKafka, 13Patch-For-Review: Prevent HaproxyKafka from hanging - https://phabricator.wikimedia.org/T400199#11053002 (10Fabfur) [08:11:00] 06Traffic, 10HaproxyKafka, 13Patch-For-Review: Prevent HaproxyKafka from hanging - https://phabricator.wikimedia.org/T400199#11053015 (10Fabfur) 05Open→03Resolved This has been addressed with the following changes: - https://gerrit.wikimedia.org/r/c/operations/puppet/+/1173950 (shrink socket batch s... [08:11:47] 06Traffic, 10HaproxyKafka, 10Data-Engineering (Q1 FY25/26 July 1st - September 30th), 13Patch-For-Review: Haproxykafka silently stops sending request data to kafka - https://phabricator.wikimedia.org/T400039#11053018 (10Fabfur) 05Open→03Resolved Closing as per T400199 and opening a new ticket dedic... [08:16:51] 06Traffic, 10HaproxyKafka: Add watchdog support for HaproxyKafka - https://phabricator.wikimedia.org/T400975 (10Fabfur) 03NEW [08:22:38] 06Traffic, 10HaproxyKafka: HaproxyKafka: log processed messages when circuit breaker kicks in - https://phabricator.wikimedia.org/T400976 (10Fabfur) 03NEW [08:26:25] 06Traffic, 10HaproxyKafka: haproxykafka: expose TLS errors as metrics - https://phabricator.wikimedia.org/T400977 (10Fabfur) 03NEW [08:27:36] 06Traffic, 10HaproxyKafka: HaproxyKafka: expose librdkafka metrics - https://phabricator.wikimedia.org/T400978 (10Fabfur) 03NEW [09:12:58] 06Traffic, 06SRE, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11053153 (10Joe) >>! In T400119#11052931, @TheDJ wrote: >>>! In T400119#11052789, @Joe wrote: >> Case in point, I can't find any request with that UA in the logs for the past fe... [09:15:08] 06Traffic, 06SRE, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11053166 (10Joe) To give a bit of context, over the last day we saw: * 62 million valid requests with no user-agent * 24.5 million valid requests with user agent `okhttp/*` * 1... [09:16:20] 06Traffic, 06SRE, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11053167 (10Alien333) Ok, thanks for the precisions! [09:57:27] hey folks! [09:57:34] I have a good news and a bad news [09:57:59] the good one is that I managed to have d-i to see the two OS SSDs for the new cp2xxx dell nodes [09:58:14] the bad one is that I had to use the Bookwom installer [09:58:26] I suspect that the Bullseye one is not compatible with the RAID controller [09:59:12] (probably kernel drivers?) [10:00:29] 06Traffic, 06DC-Ops, 10ops-codfw, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11053246 (10elukey) I finally found a way to make the Debian Installer to see the two OS disks, namely using Bookworm: ` ~ # ls /dev/sd* /dev/sda /dev/sda1 /dev/sdb... [10:01:30] thanks elukey looking at it shortly [10:09:33] no rush, I think that we'll need to figure out if Bookworm is a viable option for those hosts now or not [10:13:19] atm it isn't unfortunately [10:15:02] lovely [10:20:16] that's right.. bookworm is a no go [10:53:59] 06Traffic: Migrate prometheus-rdkafka-exporter to Gitlab - https://phabricator.wikimedia.org/T400985 (10Fabfur) 03NEW [11:03:35] fabfur: at that rate of opening phab tasks you'll get nicknamed the bureaucrat [11:04:08] I go on PTO next week, I want to sure you will have enough fun without me [11:04:55] 06Traffic: Bump prometheus-rdkafka-exporter go version - https://phabricator.wikimedia.org/T400986 (10Fabfur) 03NEW [11:06:24] do we really need tasks for that? :) [11:06:41] given my memory, yes :D [11:06:54] the problem is that you also forget about open tasks :) [11:08:09] mmm looks like we don't use the package for prometheus-rdkafka-exporter anywhere [11:08:16] we just use it as a library [11:21:29] we use it to build golang binaries debian style [11:21:32] that's why it's packaged [11:46:41] 06Traffic: Add ability to validate JWTs in haproxy - https://phabricator.wikimedia.org/T400238#11053541 (10Vgutierrez) @Tgr where I can find the public key of the RSA pair used to sign tokens for the rest API? [12:09:59] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Discrepencies with cableid & ports on some msw in c/d <-> msw1-eqiad - https://phabricator.wikimedia.org/T400159#11053589 (10Jclark-ctr) 05Open→03Resolved [13:04:57] 06Traffic, 06collaboration-services, 10Phabricator, 06SRE: traffic from Discord and Slack unfurler service is blocked by phabricator.wikimedia.org - https://phabricator.wikimedia.org/T400540#11053743 (10Jelto) [13:05:43] 06Traffic, 06collaboration-services, 10Phabricator, 06SRE: traffic from Discord and Slack unfurler service is blocked by phabricator.wikimedia.org - https://phabricator.wikimedia.org/T400540#11053744 (10Jelto) [13:19:00] 06Traffic, 13Patch-For-Review: Add ability to validate JWTs in haproxy - https://phabricator.wikimedia.org/T400238#11053764 (10Vgutierrez) We are currently successfully validating JWT tokens for api.wm.o, action API and rest API: `name=api.wm.o - ReqHeader host: api.wikimedia.org - ReqHeader x-pr... [14:09:42] vgutierrez: re: bookworm no go, is there a timeline for it or just a lot of big blockers? I am not sure what alternatives we have for the new cp2xxx nodes, I fear debian install needs a more up-to-date kernel for that raid controller [14:10:22] openssl 3.0 is the main blocker [14:10:24] and it's a big one [14:10:27] elukey: the blockers are essentially OpenSSL [14:10:47] yeah. and while we are working on that, we really don't have a timeline yet. [14:11:27] okok, maybe Moritz will have some creative ideas about this when he'll be back :D [14:14:06] in the past, we have used a backported kernel for issues like these (IIRC it was 5.10 on buster) [14:14:52] yeah, this was: modules/profile/manifests/base/linux510.pp [14:14:59] -# Some use cases: [14:14:59] -# - Hosts with GPU (AMD ROCm drivers are published to the kernel, the more recent the better). [14:15:02] -# This includes Machine Learning and Analytics. [14:15:04] -# - cloudgw specific NAT settings used by the Cloud team. [14:15:07] -# - bnx2x NICs firmware issues (cloudnet servers, see T271058) [14:15:09] T271058: cloudnet1004/cloudnet1003: network hiccups because broadcom driver/firmware problem - https://phabricator.wikimedia.org/T271058 [14:23:29] sukhe: the tricky bit is also to update the debian install to use that, I have no idea if/how to do it but I guess once the kernel is available it shouldn't be an issue [14:27:40] elukey: true true, I don't know/remember how we were handling that bit but perhaps that was not during the install but after the system was up [14:47:49] 06Traffic, 06collaboration-services, 10Phabricator, 06SRE: traffic from Discord and Slack unfurler service is blocked by phabricator.wikimedia.org - https://phabricator.wikimedia.org/T400540#11054064 (10Novem_Linguae) 05Open→03Resolved a:03Novem_Linguae Marking as resolved. Thanks! [15:15:23] 06Traffic, 06Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Wikimedia-Site-requests, 13Patch-For-Review: ESI test string is still shipped by CentralNotice - https://phabricator.wikimedia.org/T400472#11054129 (10R4356th) Hi @AKanji-WMF, just to be clear, this is not blocked on anything on SR... [15:17:46] 06Traffic, 06collaboration-services, 10Phabricator, 06SRE: traffic from Discord and Slack unfurler service is blocked by phabricator.wikimedia.org - https://phabricator.wikimedia.org/T400540#11054134 (10Pppery) a:05Novem_Linguae→03None [15:37:11] 06Traffic, 06Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Wikimedia-Site-requests, 13Patch-For-Review: ESI test string is still shipped by CentralNotice - https://phabricator.wikimedia.org/T400472#11054188 (10AKanji-WMF) Thanks @R4356th - I'll add this back to our team's triage. [16:11:58] 06Traffic, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06SRE: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11054317 (10Joe) >>! In T400881#11052795, @A_smart_kitten wrote: >>>! In T400881#1... [16:12:59] 06Traffic, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06SRE: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11054327 (10Joe) [16:39:40] 10netops, 06Infrastructure-Foundations: Nokia SR-Linux ZTP - https://phabricator.wikimedia.org/T401013 (10ayounsi) 03NEW p:05Triage→03Low [18:15:20] 06Traffic, 06SRE, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11054588 (10DavidBrooks) >>! In T400119#11053166, @Joe wrote: > There won't be adding some magical regexes trying to ban any single case. We will make the... [18:55:27] 06Traffic, 06DC-Ops, 10ops-codfw, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11054728 (10BCornwall) a:05BCornwall→03None [19:04:07] 10Domains, 06Traffic, 06Moderator-Tools-Team, 06The-Wikipedia-Library: Import wikipedialibrary.org into wmf redirection services - https://phabricator.wikimedia.org/T400367#11054744 (10BCornwall) wikipedialibrary.org works with both HTTP and HTTPS now: `bash [~]$ curl -s https://www.wikipedialibrary.org -... [19:04:37] 10Domains, 06Traffic, 06Moderator-Tools-Team, 06The-Wikipedia-Library: Import wikipedialibrary.org into wmf redirection services - https://phabricator.wikimedia.org/T400367#11054745 (10BCornwall) 05In progress→03Resolved We'll follow up discussion on general process improvements for this in ye old... [19:55:00] 06Traffic, 10DNS, 06SRE: Verify wikimediafoundation.org for Visual Studio Marketplace. - https://phabricator.wikimedia.org/T400089#11054828 (10BCornwall) 05In progress→03Resolved Setting as resolved. Please re-open if it hasn't worked or if anything else is needed. Thanks! [20:02:14] 06Traffic, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06SRE: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11054847 (10Bawolff) InstantCommons does work in a way that is pretty easy to abus... [20:53:00] 06Traffic: Investigate setting init_on_alloc=0 on cache hosts - https://phabricator.wikimedia.org/T401025 (10BCornwall) 03NEW [22:46:04] 06Traffic: Migrate MarkMonitor redirection services over to ncredir - https://phabricator.wikimedia.org/T400731#11055059 (10BCornwall) I'm seeing a handful of .ee domains that I'm unable to update NS servers without paying $40 (each!): > Updates to this unautomated domain's nameservers will incur a $ 40.00 char...