[00:59:30] 06Traffic, 06MediaWiki-Platform-Team (Radar), 13Patch-For-Review, 07User-notice: [Main Rollout] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510#11292126 (10Krinkle) Misc notes from the remaining audits across Wiktionary, Wikisource, Commons, Wikidata and Wik... [00:59:41] 06Traffic, 06MediaWiki-Platform-Team (Radar), 13Patch-For-Review, 07User-notice: [Main Rollout] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510#11292127 (10Krinkle) 05Open→03Resolved [01:01:06] 06Traffic, 06MediaWiki-Platform-Team (Radar), 13Patch-For-Review, 07User-notice: [Main Rollout] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510#11292134 (10Krinkle) [04:13:06] 06Traffic, 10observability: X-Request-Id response header off by 5000 - https://phabricator.wikimedia.org/T407826 (10tstarling) 03NEW [04:17:20] 06Traffic, 06SRE, 06MediaWiki-Platform-Team (Radar): Have CDN edge set the `X-Request-Id` header for incoming external requests - https://phabricator.wikimedia.org/T221976#11292290 (10tstarling) {T407826} may be related. [04:19:47] 06Traffic, 10observability: X-Request-Id response header off by 5000 - https://phabricator.wikimedia.org/T407826#11292292 (10tstarling) [04:55:25] 06Traffic, 06MediaWiki-Platform-Team (Radar), 13Patch-For-Review, 07User-notice: [Main Rollout] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510#11292306 (10RolandUnger) @Kinkle, es.wikivoyage.org: The code of https://es.wikivoyage.org/wiki/MediaWiki:Mobile.j... [05:20:45] 06Traffic, 06MediaWiki-Platform-Team (Radar), 13Patch-For-Review, 07User-notice: [Main Rollout] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510#11292317 (10RolandUnger) @Krinkle, de.wikivoyage.org: The "Karte" button is created by a Javascript and consists o... [07:02:43] FIRING: HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=esams&var-instance=cp3070&viewPanel=panel-19 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [07:07:43] FIRING: [21x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [07:17:43] RESOLVED: [21x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [07:27:52] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T407833 (10LSobanski) 03NEW [07:58:56] 06Traffic, 10bot-traffic-requests: Global block exception for AddDesc app - https://phabricator.wikimedia.org/T407706#11292633 (10Joe) Hi, I'm not sure this task is tagged with the right tags. These addresses are part of google cloud compute, which we don not blanket-block at the CDN. We rather block a few ve... [08:43:02] 06Traffic, 10bot-traffic-requests: Global block exception for AddDesc app - https://phabricator.wikimedia.org/T407706#11292764 (10Joe) For instance, I see some of the IP ranges from GCP are part of a [[ https://meta.wikimedia.org/wiki/Special:Log?type=gblblock&page=35.203.128.0/18 | global on-wiki block ]], so... [08:48:20] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Nokia: add new switches in eqiad/codfw to monitoring and make 'active' - https://phabricator.wikimedia.org/T405558#11292788 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d70417af-8325-49e7-a880-7a0cd37bd2d2) set by cmo... [09:16:38] 06Traffic, 10observability: X-Request-Id response header off by 5000 - https://phabricator.wikimedia.org/T407826#11292897 (10Vgutierrez) This is pretty weird, according to [RFC 9562](https://www.rfc-editor.org/rfc/rfc9562) for UUID v4, the third block should always start with a `4`. A quick check suggests tha... [09:58:32] 06Traffic, 10observability: X-Request-Id response header off by 5000 - https://phabricator.wikimedia.org/T407826#11293094 (10tstarling) If I send an X-Request-Id header to mw-web.svc, I get it back intact: ` [0953][tstarling@deploy2002:~]$ curl -s -I --connect-to en.wikipedia.org:443:mw-web.svc.codfw.wmnet:44... [10:01:10] o/ just a heads-up I have a low-impact test2wiki gateway-check rerouting change that I'd like to merge, won't need to stop puppet or anything [10:01:13] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1189936 [10:01:36] hnowlan: cool [10:01:46] thanks [10:02:30] hnowlan: BTW maybe you have some useful insights regarding https://phabricator.wikimedia.org/T407826#11292897 [10:04:14] it looks like mwdebug is tampering with `x-request-id` values [10:04:19] wtf, that is bizarre [10:07:37] 06Traffic, 06serviceops: X-Request-Id response header off by 5000 - https://phabricator.wikimedia.org/T407826#11293119 (10hnowlan) [10:25:24] 06Traffic, 06serviceops: X-Request-Id response header off by 5000 - https://phabricator.wikimedia.org/T407826#11293148 (10Clement_Goubert) This is envoy because of tracing: `lang=shell-session # Request apache directly in a mw-debug pod namespace root@wikikube-worker1064:/home/cgoubert# curl -k -v -s -I -H 'Ho... [10:47:37] 06Traffic, 06serviceops: X-Request-Id response header off by 5000 - https://phabricator.wikimedia.org/T407826#11293214 (10Clement_Goubert) We need to set ` request_id_extension: typed_config: "@type": type.googleapis.com/envoy.extensions.request_id.uuid.v3.UuidRequestIdConfig pack_trace_reason: fals... [10:51:50] 06Traffic, 06serviceops: X-Request-Id response header off by 5000 - https://phabricator.wikimedia.org/T407826#11293222 (10Clement_Goubert) 05Open→03In progress p:05Triage→03High a:03Clement_Goubert [10:56:52] claime: thx <3 [10:57:06] vgutierrez: yw <3 [11:13:14] dear traffic, I will merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1197230 to fix a test, and restart pybal. Any concerns? [11:15:40] no concerns [11:17:44] cheers [11:42:05] !log restarted pybal on lvs1020*,lvs2014* [11:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:36] !log restarted pybal on lvs1019* [12:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:28] !log restarted pybal on lvs2014* [12:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:27] 06Traffic, 06serviceops, 13Patch-For-Review: X-Request-Id response header off by 5000 - https://phabricator.wikimedia.org/T407826#11293440 (10Clement_Goubert) ` cgoubert@deploy2002:/srv/deployment-charts/charts/mediawiki$ curl -s -I --connect-to en.wikipedia.org:443:mwdebug.svc.eqiad.wmnet:4444 -H'x-request-... [12:40:29] 06Traffic, 06serviceops: X-Request-Id response header off by 5000 - https://phabricator.wikimedia.org/T407826#11293479 (10CDanis) >>! In T407826#11293440, @Clement_Goubert wrote: > Fixed in debug, now to redeploy every service using the mesh... The Mediawiki deployments alone are probably 95%+ of the actual d... [13:23:26] 06Traffic, 10Observability-Alerting: Alertmanager triggers an alert on IRC and email after the alert has resolved - https://phabricator.wikimedia.org/T407787#11293621 (10Vgutierrez) p:05Low→03Medium downtiming for prometheus alertmanager seems broken to me. What we are seeing here looks like this: * The me... [13:35:01] 06Traffic, 10Observability-Alerting, 10Spicerack, 10SRE-tools: Alertmanager triggers an alert on IRC and email after the alert has resolved - https://phabricator.wikimedia.org/T407787#11293662 (10ssingh) > It looks like spicerack should check that alerts for the downtimed host have been resolved (not in fi... [13:53:43] FIRING: [9x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [13:58:43] FIRING: [16x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [14:06:39] 06Traffic, 06Infrastructure-Foundations, 10Observability-Alerting, 10Spicerack, 10SRE-tools: Alertmanager triggers an alert on IRC and email after the alert has resolved - https://phabricator.wikimedia.org/T407787#11293886 (10Volans) For some related historical context on the lack of parity between the I... [14:08:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-trafficserver-backend-exporter.service on cp4052:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:08:43] RESOLVED: [16x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [14:55:47] 06Traffic, 10bot-traffic-requests: Global block exception for AddDesc app - https://phabricator.wikimedia.org/T407706#11294097 (10Luky001) **What error status code and message you get back?** ` WARNING: API error permissiondenied: You do not have the permissions needed to carry out this action. ` This probabl... [15:33:25] FIRING: [3x] SystemdUnitFailed: haproxy.service on cp1108:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:08:07] 06Traffic, 06Experimentation Lab: Test the impact of incremental increase in traffic for cache splitting experiments - https://phabricator.wikimedia.org/T407570#11294542 (10JVanderhoop-WMF) p:05Triage→03High [16:09:06] 06Traffic, 06Experimentation Lab: Test the impact of incremental increase in traffic for cache splitting experiments - https://phabricator.wikimedia.org/T407570#11294554 (10JVanderhoop-WMF) [16:13:25] FIRING: [2x] SystemdUnitFailed: haproxykafka.service on cp1111:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:38:25] FIRING: [3x] SystemdUnitFailed: haproxy.service on cp1114:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:43:06] 06Traffic, 06SRE, 05FY2025-26 WE3.3 Engaging core audiences, 06Reader Experience Team (REx Sprint 8 [Q2 Oct 21-Nov 3]): [Reading Lists] Monitor potential performance impact of Reading Lists for Web - https://phabricator.wikimedia.org/T397526#11295222 (10SToyofuku-WMF) [17:43:43] ^ all cp-eqiad alerts that happen now are actual alerts. [17:47:25] ack [18:38:25] FIRING: [3x] SystemdUnitFailed: haproxy.service on cp3068:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:38:42] ^ all good [18:38:58] so no email alerts today, just delayed IRC ones [19:18:21] 10Domains, 06Traffic: Transfer wikipedia.pt domain to community - https://phabricator.wikimedia.org/T404913#11295689 (10ssingh) Hi @CRoslof: This is another ticket that we would like to take up and will need your help with so that we can reflect it in downstream services as well. Let me know if I should create... [19:18:25] FIRING: [3x] SystemdUnitFailed: haproxy.service on cp3069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:18:34] ^ nothing to worry [19:30:54] anyone have a minute to look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1197685 ? [19:31:27] only affects the one host in the pcc line [19:37:34] * sukhe peeks [19:38:23] only set on cp7008 [19:38:29] makes it easy :) [19:42:31] cdanis: +1ed with the caveat that I am assuming you have tested the particular stanza already [19:42:47] if you want someone to review that bit as well then I will remove the +1 [20:03:25] FIRING: [3x] SystemdUnitFailed: haproxy.service on cp3070:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:23:29] sukhe: no that's alright, there's an equivalent block in tls_terminator already enabled since this morning [20:23:31] thanks! [20:27:36] wish puppet-agent didn't take so long [20:43:25] FIRING: [3x] SystemdUnitFailed: haproxy.service on cp3071:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:18:43] FIRING: HaproxyKafkaExporterDown: HaproxyKafka on cp3073 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=esams&var-instance=cp3073 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [22:20:40] FIRING: VarnishPrometheusExporterDown: Varnish Exporter on instance cp3073:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [22:58:43] FIRING: HaproxyKafkaExporterDown: HaproxyKafka on cp3073 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=esams&var-instance=cp3073 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [23:00:21] 06Traffic, 06serviceops: X-Request-Id response header off by 5000 - https://phabricator.wikimedia.org/T407826#11296564 (10RLazarus) This is deployed to all services.