[02:54:42] 06Traffic, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06SRE, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10775088 (10Kirilloparma) >>! In T374230#10771849, @Silvan_WMDE wrote: > @Kirillopa... [03:19:18] 06Traffic, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06SRE, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10775097 (10Jakob_WMDE) >>! In T374230#10775088, @Kirilloparma wrote: > > @Silvan_WMD... [08:28:28] 06Traffic: Move host normalization to haproxy - https://phabricator.wikimedia.org/T392880 (10Fabfur) 03NEW [08:43:25] FIRING: SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:44:11] need to check why this alert doesn't get silenced [08:52:55] RESOLVED: SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:12:46] 06Traffic: Move method check from varnish to HAProxy - https://phabricator.wikimedia.org/T392073#10775640 (10Fabfur) Leaving this open as memo to remove Varnish configuration at a later moment [09:12:59] 06Traffic: Move method check from varnish to HAProxy - https://phabricator.wikimedia.org/T392073#10775644 (10Fabfur) 05Open→03In progress p:05Medium→03Low [09:17:25] FIRING: SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:17:39] 06Traffic: Move method check from varnish to HAProxy - https://phabricator.wikimedia.org/T392073#10775654 (10Vgutierrez) 05In progress→03Stalled [09:22:25] RESOLVED: SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:26:05] 👋 We are having an issue when purging edge cache for PCS endpoint URLs that contain ":". I ve added steps to reproduce the issue here: https://phabricator.wikimedia.org/T392849 [09:26:27] Any idea what is wrong with the way we send purge events ? [09:32:20] 06Traffic, 06Content-Transform-Team, 06serviceops: Purging edge caches doesn't work for articles with ":" in their title - https://phabricator.wikimedia.org/T392849#10775677 (10Jgiannelos) FYI this is not reproduced on endpoints not migrated to rest-gateway yet. eg: * Given this page * https://en.wikipedia... [09:36:59] nemo-yiannis: last I've heard from _joe_ is that mwscript-k8s can't be used to purge requests yet [09:38:18] effie: ^^ do you know if that bug has been fixed? [09:40:43] I don't think that our main issue is mwscript-but rather the purge events we send to kafka from PCS via eventgate [09:41:37] *mwscript-k8s but [09:43:59] nemo-yiannis: have you tried using %3A instead of :? [09:45:12] I am afraid I do not know [09:45:24] effie: who could be aware of that? [09:45:29] Yeah, we escape the title part [09:45:42] There are some requests responses and events from kafka for details on the ticket [09:46:25] FIRING: SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs5005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:46:34] vgutierrez: rz.l [09:47:01] let me take a look see if I can help [09:47:30] 06Traffic, 06Infrastructure-Foundations, 10Spicerack, 10SRE-tools: Spicerack's Icinga module should provide a way to skip specific services in sub-optimal but desired state - https://phabricator.wikimedia.org/T392848#10775722 (10elukey) We discussed the options on IRC, to summarize: 1) The DNS cookbook co... [09:48:07] hmmm `FYI this is not reproduced on endpoints not migrated to rest-gateway yet. eg:` [09:49:53] first thing ^ that made me think is an issue with normalisation [09:50:15] but the events that hit kafka are have the same correct URLs for migrated and unmigrated wikis [09:50:39] and purges work for other URLs for migrated wikis [09:51:25] RESOLVED: SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs5005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:51:42] 06Traffic, 06Infrastructure-Foundations, 10Spicerack, 10SRE-tools: Spicerack's Icinga module should provide a way to skip specific services in sub-optimal but desired state - https://phabricator.wikimedia.org/T392848#10775738 (10elukey) From https://icinga.com/docs/icinga-2/latest/doc/24-appendix/ it seems... [09:51:52] migration happens via gateway-check.lua right? [09:52:25] yeah [09:52:52] so same kind of normalization happens for migrated and not migrated wikis [09:52:56] at least at the CDN [09:53:21] https://www.irccloud.com/pastebin/JGuJ96AH/ [09:53:32] that's the relevant part of backend.yaml [09:54:37] nemo-yiannis: from normalize-path.lua comments... [09:54:47] -- path = "/wiki/User:Ema%2fProfiling_Python%28Now you know[dude]" [09:54:47] -- return "/wiki/User:Ema/Profiling_Python(Now you know%5Bdude%5D" [09:55:07] jgiannelos [09:55:15] so it looks like `:` shouldn't be encoded in your PURGE requests [09:57:02] so `:` gets decoded but not encoded [09:57:07] (at the CDN) [09:57:17] rest-gateway is aware of that? [09:57:36] 06Traffic, 06Content-Transform-Team, 06serviceops: Purging edge caches doesn't work for articles with ":" in their title - https://phabricator.wikimedia.org/T392849#10775748 (10hnowlan) From kafka - a successful enwiki purge and a failing testwiki purge: ` { "$schema": "/resource_change/1.0.0", "meta":... [09:58:40] by default the gateway doesn't touch normalisation either way [09:59:01] it appears that normalisation isn't affecting the purges themselves, see https://phabricator.wikimedia.org/T392849#10775748 [09:59:11] but it clearly is if it's only happening for the gateway [09:59:31] ack... let's see if I can debug this :D [10:00:41] purges are working for other migrated pages fwiw [10:01:00] other as in not containing `:^ in the URL? [10:01:03] yeah [10:01:03] `:` [10:01:05] ack [10:05:25] FIRING: SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs5004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:10:23] FWIW i havent tested other special characters but this came up while testing `User:` namespace pages [10:17:36] so.. a quick check filtering by ReqUrl = /api/rest_v1/page/mobile-html/User%3AJGiannelos_%28WMF%29%2Ftest-pcs-rollout [10:17:50] cp6015 didn't receive a PURGE request after I edited that page [10:18:37] let's wide the net... and filter urls that contains JGiannelos [10:19:55] ok... now I see the PURGEs [10:20:25] RESOLVED: SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs5004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:22:33] 06Traffic, 06Content-Transform-Team, 06serviceops: Purging edge caches doesn't work for articles with ":" in their title - https://phabricator.wikimedia.org/T392849#10775807 (10Vgutierrez) from varnish point of view, after editing https://test.wikipedia.org/wiki/User:JGiannelos_(WMF)/test-pcs-rollout the fol... [10:25:10] so http://test.wikipedia.org/api/rest_v1/page/mobile-html/User%3AJGiannelos_(WMF)%2Ftest-pcs-rollout gets a PURGE request [10:25:26] note that `(` and `)` aren't encoded [10:27:47] huh [10:27:51] 06Traffic, 06Content-Transform-Team, 06serviceops: Purging edge caches doesn't work for articles with ":" in their title - https://phabricator.wikimedia.org/T392849#10775829 (10Vgutierrez) a quick check shows that the URL receiving the PURGE is purged as expected: ` vgutierrez@carrot:~$ curl -4 'https://test... [10:28:31] see https://phabricator.wikimedia.org/T392849#10775829 [10:30:38] Let me try again without encoding the parenthesis [10:36:41] vgutierrez: I do get `x-cache-status: miss` on your steps on the ticket but the content is not updated [10:37:26] that's definitely another issue... [10:37:35] let me track the whole request flow via ATS [10:37:43] at the same time but the response from the service is up-to-date [10:38:04] `curl -k "https://mobileapps.svc.eqiad.wmnet:4102/test.wikipedia.org/v1/page/mobile-html/User%3AJGiannelos_(WMF)%2Ftest-pcs-rollout"` [10:38:14] not via rest-gateway [10:38:34] nemo-yiannis: URI Hostname could have an impact there? [10:38:47] i don't think s [10:38:48] you can use --connect-to [10:39:12] hnowlan: Is there any way i can send the same request but on rest-gateway level ? [10:40:00] nemo-yiannis: yep, curl 'https://rest-gateway.discovery.wmnet:4113/test.wikipedia.org/v1/page/mobile-html/User%3AJGiannelos_(WMF)%2Ftest-pcs-rollout' [10:40:31] its also renders the latest [10:41:14] Date:2025-04-29 Time:10:40:51 ConnAttempts:0 ConnReuse:0 TTFetchHeaders:271 ClientTTFB:271 CacheReadTime:0 CacheWriteTime:0 TotalSMTime:271 TotalPluginTime:0 ActivePluginTime:0 TotalTime:271 OriginServer:rest-gateway.discovery.wmnet OriginServerTime:271 CacheResultCode:TCP_MISS CacheWriteResult:FIN ReqMethod:GET RespStatus:200 OriginStatus:200 [10:41:14] ReqURL:http://test.wikipedia.org/api/rest_v1/page/mobile-html/User:JGiannelos_(WMF)%2Ftest-pcs-rollout ReqHeader:User-Agent:curl/7.88.1 ReqHeader:Host:test.wikipedia.org ReqHeader:X-Client-IP:81.39.0.137 ReqHeader:Cookie: BerespHeader:Set-Cookie:- BerespHeader:Cache-Control:s-maxage=1209600, max-age=0 BerespHeader:Connection:- RespHeader:X-Cache-Int:cp6010 miss RespHeader:Backend-Timing:- [10:41:23] that's a cache miss logged by ATS as well after the PURGE [10:42:11] so ATS is actually re-fetching the content [10:44:25] FIRING: SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs4009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:44:50] ha [10:45:27] This returns the latest revision: `curl "https://rest-gateway.discovery.wmnet:4113/test.wikipedia.org/v1/page/mobile-html/User%3AJGiannelos_(WMF)%2Ftest-pcs-rollout"` [10:45:44] but User:JGiannelos offers a a stale version? [10:45:51] curl "https://rest-gateway.discovery.wmnet:4113/test.wikipedia.org/v1/page/mobile-html/User:JGiannelos_(WMF)%2Ftest-pcs-rollout" [10:45:54] stale [10:46:06] .) [10:46:07] :) [10:47:58] i am confused :) [10:49:01] 06Traffic, 06Content-Transform-Team, 06serviceops: Purging edge caches doesn't work for articles with ":" in their title - https://phabricator.wikimedia.org/T392849#10775884 (10Vgutierrez) ATS also shows how it's performing the request to the origin server after a PURGE: ` Date:2025-04-29 Time:10:40:51 ConnA... [10:49:25] RESOLVED: SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs4009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:49:47] nemo-yiannis: origin server is definitely out of my scope :) [10:50:02] hnowlan: any ideas ? [10:54:30] I need to check if RB did some extra normalization [10:56:50] nemo-yiannis: mobileapps also returns stale content directly when using ":" [10:57:38] possibly some kind of cache key issue? [10:58:50] looking at rb [10:59:25] FIRING: SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs4008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:04:25] RESOLVED: SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs4009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:31:30] FIRING: [4x] AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum2001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [12:31:47] ^ yes, restarts in progress [12:36:30] RESOLVED: [4x] AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum1001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [12:36:45] FIRING: [4x] AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum1001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [12:41:30] RESOLVED: [4x] AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum1001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [12:46:30] FIRING: [4x] AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum1001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [12:51:30] RESOLVED: [3x] AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum1001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [12:58:25] FIRING: SystemdUnitFailed: anycast-healthchecker.service on durum3003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:58:36] expected ^ [13:03:25] RESOLVED: SystemdUnitFailed: anycast-healthchecker.service on durum3003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:14:25] FIRING: SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs6002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:21:18] I'm wondering why this keep firing if in theory spicerack takes care of silencing both icinga & alertmanager [13:22:03] should also be true for example for the anycast-hc alerts but I have always presumed that there is a race condition somewhere in when the alert is detected and fired, because otherwise it should happen in all cases and it doesn't? [13:24:25] RESOLVED: SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs6002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:28:19] 2025-04-29 13:08:11,222 vgutierrez 235109 [INFO] Scheduling downtime on Icinga server alert1002.wikimedia.org for hosts: lvs6002 [13:28:30] alert got triggered 3 minutes later [13:28:39] sorry... 6 minutes later [13:36:25] FIRING: SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs6001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:42:55] RESOLVED: SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs6001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:59:25] FIRING: SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs3009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:00:38] yeah we should look into this, given how frequently it is firing [14:03:58] 06Traffic, 06Data-Engineering, 10DPE HAProxy Migration: Add HAproxy termination field to webrequest - https://phabricator.wikimedia.org/T387454#10776594 (10Fabfur) @JAllemandou the change has been deployed in production, now all haproxykafka instances on cache hosts are sending the `termination_state` field... [14:04:15] 06Traffic, 06Data-Engineering, 10DPE HAProxy Migration: Add HAproxy termination field to webrequest - https://phabricator.wikimedia.org/T387454#10776600 (10Fabfur) [14:05:55] RESOLVED: SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs3009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:15:56] sukhe: it's firing for every single depool [14:17:55] FIRING: [2x] SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs3008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:24:55] RESOLVED: SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs3008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:13:15] 06Traffic: wmfuniq-keygen: Install to /usr/bin, not /usr/sbin - https://phabricator.wikimedia.org/T392937 (10BCornwall) 03NEW [17:13:36] 06Traffic: wmfuniq-keygen: Install to /usr/bin, not /usr/sbin - https://phabricator.wikimedia.org/T392937#10777672 (10BCornwall) 05Open→03In progress p:05Triage→03Low [17:16:08] 06Traffic: wmfuniq-keygen: Install to /usr/bin, not /usr/sbin - https://phabricator.wikimedia.org/T392937#10777680 (10Dzahn) Or should it be /usr/local/bin/ because it's our own software that we install? https://refspecs.linuxfoundation.org/FHS_3.0/fhs/ch04s09.html [17:18:08] 06Traffic: wmfuniq-keygen: Install to /usr/bin, not /usr/sbin - https://phabricator.wikimedia.org/T392937#10777683 (10BCornwall) Considering it's a proper debian package, I think /usr/bin is more appropriate IMO. [17:20:48] 06Traffic, 13Patch-For-Review: wmfuniq-keygen: Install to /usr/bin, not /usr/sbin - https://phabricator.wikimedia.org/T392937#10777690 (10Dzahn) Ah, I see!. Yea, whatever the definition of "locally installed" is then. Not a strong opinion either way! [17:54:17] 06Traffic, 06Data-Engineering-Radar, 06Data-Platform-SRE, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review: Replicate current low-message alerting from VarnishKafka - https://phabricator.wikimedia.org/T391810#10777852 (10Ahoelzl) [20:51:21] Would someone be available tomorrow to work on tearing down wdqs-internal lvs? (cc brett) [21:00:20] ryankemper: Sure, we can do that. When's a good time? [21:02:48] brett: starting either at 11am or 2pm pst works for me, what's your preference? [21:03:23] Sorry to be pedantic but I assume you meant pdt? 11 am works for me [21:13:14] yes :) I can never remember which one we're in [21:13:20] cool let's plan on 11am then. i'll make a calendar event [21:26:33] brett: okay, calendar event up. I added links to the 2 patch chains (dns repo and puppet repo) in the description [21:27:23] Thank you! [21:58:42] 06Traffic, 06DC-Ops, 10ops-codfw, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10778576 (10BCornwall)