[01:49:21] 10Traffic, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10TomerLerner) Thank you @akosiaris We can only run client requests in the production URL, I guess it'll do for now until we... [09:31:55] 10Traffic, 10Performance-Team, 10SRE, 10SRE-swift-storage, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10MatthewVernon) Yeah, this is my concern, too - we used to spawn extra requests to copy new thumbnails to the other DC and that ca... [09:42:49] <_joe_> vgutierrez: re T342577 [09:42:50] T342577: Data Quality - requestctl not getting set - https://phabricator.wikimedia.org/T342577 [09:43:03] <_joe_> looking at cluster_fe_ratelimit in text-frontend [09:43:04] _joe_: maybe -analytics is a better channel for that [09:43:09] <_joe_> right [09:43:27] milimetric is there and not here :) [09:43:27] <_joe_> well not really, this is a vcl thing [09:43:32] <_joe_> ack [09:59:15] 10Traffic, 10Performance-Team, 10SRE, 10SRE-swift-storage, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Ladsgroup) It might sound a bit stupid: Why not just gradually, slowly, roll delete all thumbnails, if it's needed, it'll be rege... [10:31:42] (SystemdUnitFailed) firing: haproxy.service Failed on lvs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:36:42] (SystemdUnitFailed) resolved: haproxy.service Failed on lvs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:44:30] duh.. lvs1014 is a test host :) [10:49:32] 10Traffic: Perform katran load tests on lvs1013 - https://phabricator.wikimedia.org/T342618 (10Vgutierrez) [10:49:54] 10Traffic: Perform katran load tests on lvs1013 - https://phabricator.wikimedia.org/T342618 (10Vgutierrez) p:05Triage→03Medium [10:51:11] 10Traffic: Perform katran load tests on lvs1013 - https://phabricator.wikimedia.org/T342618 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0241bb95-cb16-4e0c-9671-afb27fba6736) set by vgutierrez@cumin1001 for 31 days, 0:00:00 on 3 host(s) and their services with reason: test hosts ` lvs[1013... [10:54:42] vgutierrez: I'd like to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/937061 given that we're setting no-cache in the places we want to - given that this is a pretty standard change I don't need to stop puppet or anything right? [10:54:50] * vgutierrez looking [10:55:10] hnowlan: you're right, please go ahead [10:55:17] thanks! [11:35:29] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [11:43:56] 10Traffic, 10Infrastructure-Foundations: NetworkProbeLimit cookie should set samesite attribute - https://phabricator.wikimedia.org/T342624 (10Reedy) [11:44:06] 10Traffic, 10Infrastructure-Foundations: NetworkProbeLimit cookie should set samesite attribute - https://phabricator.wikimedia.org/T342624 (10Reedy) [11:46:34] 10Traffic, 10Infrastructure-Foundations, 10SRE: NetworkProbeLimit cookie should set samesite attribute - https://phabricator.wikimedia.org/T342624 (10Reedy) [14:03:33] vgutierrez: would you have time to give https://gerrit.wikimedia.org/r/941405 one more spin with what's left of today? [14:04:54] hnowlan: this is the one that we had to rollback due to api-gateway.discovery.wmnet caching being set to pass, right? [14:06:46] yeah [14:14:20] hnowlan: feel free to proceed [14:28:00] vgutierrez: great, doing it now [14:32:50] puppet stopped on A:cp, cp2037 depooled, running puppet on cp2037 now [14:34:21] nice [14:35:24] puppet is done [14:38:35] hnowlan: caching works as expected now [14:38:59] vgutierrez: great! [14:39:17] are you seeing that in logstash? I was trying to find equivalent log messages to the ones you posted previously [14:39:28] logstash? [14:39:36] for a 200 response? [14:39:48] we don't do that here AFAIK ;P [14:40:14] haha of course [14:40:15] I had vgutierrez@cp2037:~$ date && curl -H "Host: wikimedia.org" -H "X-Forwarded-Proto: https" 127.0.0.1:3128/api/rest_v1/metrics/unique-devices/en.wikipedia.org/all-sites/daily/20230506/20230704?vgutierrez=1 -v -o /dev/null -s 2>&1 && date [14:40:15] handy [14:40:42] I was thinking of the "ReqURL:http://wikimedia.org/api/rest_v1/metrics/unique-devices/en.wikipedia.org/all-sites/daily/20230506/20230704?vgutierrez=17406 ReqHeader:User-Agent:curl/7.74.0 ReqHeader:Host:wikimedia.org ReqHeader:X-Client-IP:- ReqHeader:Cookie:- BerespHeader:Set-Cookie:- BerespHeader:Cache-Control:s-maxage=14400, max-age=14400 BerespHeader:Connection:- RespHeader:X-Cache-Int:cp2039 [14:40:48] miss RespHeader:Backend-Timing:-" [14:40:58] hnowlan: that's atslog-backend output [14:41:07] 10Traffic, 10SRE: Upgrade to pdns-recursor 4.8.4 - https://phabricator.wikimedia.org/T341611 (10ssingh) [14:41:17] it's used as input for some mtail programs to get ATS metrics [14:41:23] but we don't send it to logstash [14:41:33] o11y friends wouldn't be happy with the amount of data [14:41:34] ah [14:41:59] AFAIK only 5xx responses from varnish get logged [14:43:04] 10Traffic, 10SRE: Upgrade to pdns-recursor 4.8.4 - https://phabricator.wikimedia.org/T341611 (10ssingh) 05Open→03Resolved ` ||/ Name Version Architecture Description +++-==============-===============-============-================================= ii pdns-recursor 4.8.4-1+wmf11u1 amd64... [14:46:48] vgutierrez: I am pretty happy with the current status if you are- how do you feel about enabling puppet? [14:47:17] hnowlan: it looks like a good day to earn a t-shirt [14:47:20] go for it [14:47:28] haha 🫡 [14:48:24] hnowlan: you will pick up our recent varnish-hospital change as well so don't be surprised :) [14:48:34] didn't realize puppet was disabled and should have waited [14:48:42] (don't get fooled by that, you only get stickers these days) [14:49:07] taavi: there have been bootlegs made :) [14:49:11] sukhe: ack! [14:52:15] "these days".. 5 years here and no damn t-shirt [14:56:53] godog: do you have any experience with eBPF exporters for prometheus? stuff like https://github.com/cloudflare/ebpf_exporter [14:58:12] vgutierrez: I do not in the sense that I've never used it before, seems quite useful though [14:58:34] godog: we are planning to melt some old lvs box running katran [14:59:13] problem with katran is that it's a XDP based LB, packets are handled with eBPF code and never hit the kernel TCP/IP stack [14:59:34] as a side effect, some tools like tcpdump are useless on its own [14:59:54] and I'm suspecting that some regular metrics will be lost as well [15:00:11] easy to believe yeah [15:02:30] vgutierrez: unique-devices is looking good, thanks so much for the help and patience <3 [15:02:44] hnowlan: nice :D [15:05:01] vgutierrez: I glanced at the docs and the only thing that stands out to me deployment-wise is the need to ship the compiled programs, docs say that >= linux 5.15 things should work though 🤷 [15:07:44] the BPF bytecode? [15:07:51] yeah [15:08:04] otherwise you need clang everywhere [15:08:18] AFAIK gcc backend for BPF isn't there (yet) [15:09:44] ah, TIL re: gcc and bpf [15:10:09] but yeah best case we're able to craft a debian package that just works everywhere [15:10:37] https://gcc.gnu.org/wiki/BPFBackEnd --> maybe my source was kinda deprecated [15:11:55] gcc-bpf/stable 12.2.0-14+4 amd64 [15:12:00] bookworm ships it for sure [15:13:40] neat, yeah bpf exporter mentions clang but not gcc so we'll see [15:14:08] katran expects to be compiled with clang [15:14:21] so I'm using clang 14 in bookworm [15:15:35] nice, do you know if native prometheus metrics are exposed in any way ? [15:28:57] nope AFAIK [15:29:10] I do have some for our healthchecks [15:29:50] and I'm expecting to use those to measure the performance while we hammer katran [15:30:55] so I'll get back to you on what would be the best way of scraping those metrics considering that will be exposed by a experimental daemon that might work or not :) [15:31:44] for sure! happy to help [15:35:29] <_joe_> I need help from a vcl expert [15:35:49] <_joe_> in text-frontend.vcl.erb, routine cluster_fe_ratelimit [15:35:49] damn, ema isn't here [15:36:20] <_joe_> we first have if (req.http.X-Public-Cloud && std.ip(req.http.X-Client-IP, "192.0.2.1") !~ wikimedia_nets && vsthrottle.is_denied("public_cloud_all:" + req.http.X-Client-IP, 1000, 10s)) [15:36:22] I'm no expert, just your designated VCL plumber, what do you need? :) [15:36:45] <_joe_> then we have, for all non-restbase urls if the request has no session token [15:36:55] <_joe_> if (req.http.X-Public-Cloud && vsthrottle.is_denied("public_cloud_uncached:" + req.http.X-Client-IP, 100, 10s)) { [15:37:28] <_joe_> so the difference is just the session token? [15:37:45] <_joe_> and I would guess this latter one kicked in last night [15:37:57] <_joe_> given it's more restrictive than the rule sukhe set up [15:38:54] _joe_: assuming that the cookie was there [15:39:23] <_joe_> we're assuming it wasn't [15:39:42] <_joe_> the latter only kicks in if (req.http.Cookie !~ "([sS]ession|Token)=" [15:40:55] _joe_: hmm right [15:51:21] _joe_: I need to be missing something obvious cause the VCL code looks buggy to me [15:51:43] <_joe_> vgutierrez: or maybe not! [15:51:47] <_joe_> what's buggy? [15:52:43] <_joe_> vgutierrez: btw I concocted https://gerrit.wikimedia.org/r/c/operations/puppet/+/941448 [15:55:00] _joe_: I don't want to volint you but s/-/_/g [15:55:13] almost every requestctl rules use _ rather than - [15:55:15] <_joe_> vgutierrez: be bold and edit the patch [15:58:23] <_joe_> I have a meeting now [16:12:47] _joe_: so public_cloud_uncached should be called public_cloud_unauth, cause both public_cloud_all and public_cloud_uncached are throttling uncached requests [16:13:04] <_joe_> vgutierrez: yes I agree [16:13:18] <_joe_> the label is wrong but it can be changed in a followup [16:13:22] yup [16:28:01] 10Traffic, 10SRE: Perform katran load tests on lvs1013 - https://phabricator.wikimedia.org/T342618 (10Vgutierrez) [16:32:42] (SystemdUnitFailed) firing: anycast-healthchecker.service Failed on dns6001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:37:42] (SystemdUnitFailed) resolved: anycast-healthchecker.service Failed on dns6001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:47:52] 10Traffic, 10SRE: Perform katran load tests on lvs1013 - https://phabricator.wikimedia.org/T342618 (10Vgutierrez) [16:50:42] (SystemdUnitFailed) firing: prometheus_gdnsd_stats.service Failed on dns6002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:51:54] wait wait [16:52:00] gdnsd really [16:52:28] oh phew, prometheus [16:55:42] (SystemdUnitFailed) resolved: (2) anycast-healthchecker.service Failed on dns6002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:58:42] (SystemdUnitFailed) firing: anycast-healthchecker.service Failed on dns4004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:03:42] (SystemdUnitFailed) resolved: anycast-healthchecker.service Failed on dns4004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:13:42] (SystemdUnitFailed) firing: anycast-healthchecker.service Failed on dns4003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:18:42] (SystemdUnitFailed) resolved: (2) anycast-healthchecker.service Failed on dns4003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:12:03] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638 (10ayounsi)