[00:03:11] <wikibugs>	 (03PS2) 10RLazarus: deployment_server: Add a script for mass-deploying helmfile services [puppet] - 10https://gerrit.wikimedia.org/r/1188456 (https://phabricator.wikimedia.org/T380211)
[00:03:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[00:04:38] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/ipoid: apply
[00:04:46] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/ipoid: apply
[00:05:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[00:08:05] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1188478
[00:08:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1188478 (owner: 10TrainBranchBot)
[00:08:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[00:08:50] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/kartotherian: apply
[00:08:59] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/kartotherian: apply
[00:09:14] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[00:11:50] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[00:13:23] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/media-analytics: apply
[00:13:35] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/media-analytics: apply
[00:13:53] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply
[00:13:59] <jinxer-wm>	 FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:17:15] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[00:19:46] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[00:19:50] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[00:20:31] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/page-analytics: apply
[00:20:40] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/page-analytics: apply
[00:20:52] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/proton: apply
[00:21:00] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: apply
[00:21:11] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/push-notifications: apply
[00:21:35] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/push-notifications: apply
[00:22:06] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[00:22:10] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[00:22:19] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/recommendation-api: apply
[00:22:27] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/recommendation-api: apply
[00:22:36] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/sessionstore: apply
[00:22:52] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/sessionstore: apply
[00:23:08] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply
[00:23:13] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply
[00:23:31] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply
[00:23:35] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[00:23:53] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply
[00:23:57] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply
[00:24:06] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply
[00:24:10] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply
[00:24:21] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply
[00:24:51] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply
[00:28:05] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1188478 (owner: 10TrainBranchBot)
[00:28:23] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/toolhub: apply
[00:28:31] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/toolhub: apply
[00:29:20] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply
[00:29:35] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply
[00:29:45] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/zotero: apply
[00:29:58] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/zotero: apply
[00:42:44] <jinxer-wm>	 FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[00:47:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[00:52:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[00:57:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[01:08:04] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.19 [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1188486 (https://phabricator.wikimedia.org/T396380)
[01:08:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.19 [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1188486 (https://phabricator.wikimedia.org/T396380) (owner: 10TrainBranchBot)
[01:23:32] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.19 [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1188486 (https://phabricator.wikimedia.org/T396380) (owner: 10TrainBranchBot)
[01:24:07] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[01:33:59] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[01:44:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183385 (10phaultfinder)
[02:00:04] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T0200)
[02:03:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[02:13:59] <jinxer-wm>	 FIRING: [14x] CertAlmostExpired: Certificate for service cloudsw1-b1-codfw.mgmt.codfw.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[02:22:08] <wikibugs>	 (03PS1) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491
[02:22:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 (owner: 10Krinkle)
[02:23:31] <wikibugs>	 (03PS2) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491
[02:23:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 (owner: 10Krinkle)
[02:24:57] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183413 (10phaultfinder)
[02:26:01] <wikibugs>	 (03PS3) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491
[02:28:54] <wikibugs>	 (03PS4) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491
[02:29:11] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:30:06] <wikibugs>	 (03PS5) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491
[02:31:04] <wikibugs>	 (03PS6) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491
[02:31:23] <wikibugs>	 (03PS7) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491
[02:32:26] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11183419 (10Papaul) @cmooney we have the spare PEM on site. I need to get on a call with Juniper to troubleshooting this. Do you think Thursd...
[02:34:08] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:34:08] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:35:17] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#11183420 (10Papaul)
[02:36:29] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#11183421 (10Papaul) 05Open→03Resolved a:03Papaul The BIO reader is installed now and working. so closing this task
[02:36:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:37:15] <wikibugs>	 (03PS1) 10Superpes15: Throttle exemption for Editathon by Wikimedistas en Cruce - 26 September 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188493 (https://phabricator.wikimedia.org/T404592)
[02:39:28] <wikibugs>	 (03PS2) 10Superpes15: Throttle exemption for Editathon by Wikimedistas en Cruce - 26 September 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188493 (https://phabricator.wikimedia.org/T404592)
[02:43:58] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:43:58] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:48:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[02:50:10] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183429 (10phaultfinder)
[03:00:05] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T0300)
[03:23:59] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[03:24:57] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183455 (10phaultfinder)
[04:00:05] <jouncebot>	 Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T0400)
[04:03:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[04:04:18] <logmsgbot>	 !log mwpresync@deploy1003 Pruned MediaWiki: 1.45.0-wmf.16 (duration: 04m 08s)
[04:05:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[04:13:59] <jinxer-wm>	 FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:24:55] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183464 (10phaultfinder)
[04:48:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[04:53:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[05:02:43] <jinxer-wm>	 FIRING: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[05:02:53] <jinxer-wm>	 FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[05:02:58] <jinxer-wm>	 FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[05:03:08] <jinxer-wm>	 FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[05:07:44] <jinxer-wm>	 FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[05:08:59] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:17:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[05:24:07] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[05:25:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:27:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[05:30:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:32:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:32:58] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[05:33:59] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:34:00] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[05:37:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[05:38:59] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:46:39] <wikibugs>	 (03PS1) 10Huei Tan: xLab: Update the PageVisit target wiki for MinT readers [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188509 (https://phabricator.wikimedia.org/T404420)
[05:47:04] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188509 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan)
[05:47:20] <wikibugs>	 (03Restored) 10Huei Tan: XLab\ResourceLoader\Hooks: Add stream to XLAB_STREAMS [extensions/MetricsPlatform] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188398 (owner: 10Huei Tan)
[05:47:29] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188398 (owner: 10Huei Tan)
[05:48:54] <hueitan>	 Hi, i have 2 patches for later backport, Kartik is not available, can you someone help with the deployment?
[05:54:57] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183555 (10phaultfinder)
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and federico3: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T0600).
[06:13:59] <jinxer-wm>	 FIRING: [14x] CertAlmostExpired: Certificate for service cloudsw1-b1-codfw.mgmt.codfw.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[06:29:11] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:36:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:47:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM, but this changes an existing sudo rule, so needs SRE IF meeting approval" [puppet] - 10https://gerrit.wikimedia.org/r/1188408 (https://phabricator.wikimedia.org/T404630) (owner: 10CDanis)
[06:52:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[06:59:56] <awight>	 I can deploy these backports.
[07:00:00] <hueitan>	 o/
[07:00:03] <hueitan>	 thanks
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T0700). nyaa~
[07:00:04] <jouncebot>	 hueitan: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:01:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188509 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan)
[07:03:14] <wikibugs>	 (03Merged) 10jenkins-bot: xLab: Update the PageVisit target wiki for MinT readers [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188509 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan)
[07:03:40] <logmsgbot>	 !log awight@deploy1003 Started scap sync-world: Backport for [[gerrit:1188509|xLab: Update the PageVisit target wiki for MinT readers (T404420)]]
[07:03:45] <stashbot>	 T404420: Enable 13 wikis for MinT for Wiki Readers A/A test - https://phabricator.wikimedia.org/T404420
[07:07:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:09:20] <icinga-wm>	 PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[07:09:40] <logmsgbot>	 !log awight@deploy1003 awight, hueitan: Backport for [[gerrit:1188509|xLab: Update the PageVisit target wiki for MinT readers (T404420)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:09:45] <stashbot>	 T404420: Enable 13 wikis for MinT for Wiki Readers A/A test - https://phabricator.wikimedia.org/T404420
[07:11:39] <awight>	 hueitan: Please check on mwdebug
[07:11:48] <hashar>	 awight: thanks for the deployments! :]
[07:12:33] <hueitan>	 awight checked, see it live now on mwdebug
[07:12:43] <jinxer-wm>	 FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:12:43] <jinxer-wm>	 FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:12:45] <awight>	 hashar: My pleasure—Spider Pig has not let me down
[07:12:51] <awight>	 hueitan: ty
[07:12:58] <logmsgbot>	 !log awight@deploy1003 awight, hueitan: Continuing with sync
[07:13:05] <hashar>	 awight: yeah it is quite rad!  Maybe one day we will have an equivalent to run Quibble from a web interface! :b
[07:13:13] <jynus>	 the bacula alert will get fixed soon
[07:15:03] <wikibugs>	 (03PS2) 10Slyngshede: Bump CAS container to 7.2.2 [software/bitu] - 10https://gerrit.wikimedia.org/r/1151178
[07:15:09] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Bump CAS container to 7.2.2 [software/bitu] - 10https://gerrit.wikimedia.org/r/1151178 (owner: 10Slyngshede)
[07:18:15] <wikibugs>	 (03Merged) 10jenkins-bot: Bump CAS container to 7.2.2 [software/bitu] - 10https://gerrit.wikimedia.org/r/1151178 (owner: 10Slyngshede)
[07:18:16] <logmsgbot>	 !log awight@deploy1003 Finished scap sync-world: Backport for [[gerrit:1188509|xLab: Update the PageVisit target wiki for MinT readers (T404420)]] (duration: 14m 35s)
[07:18:20] <stashbot>	 T404420: Enable 13 wikis for MinT for Wiki Readers A/A test - https://phabricator.wikimedia.org/T404420
[07:18:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy1003 using scap backport" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188398 (owner: 10Huei Tan)
[07:18:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-ulsfo and Arelion (2001:2035:0:a9a::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[07:18:45] <awight>	 Finished.  On to the second patch...
[07:18:59] <jinxer-wm>	 FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[07:19:21] <jinxer-wm>	 FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:21:04] <wikibugs>	 (03Merged) 10jenkins-bot: XLab\ResourceLoader\Hooks: Add stream to XLAB_STREAMS [extensions/MetricsPlatform] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188398 (owner: 10Huei Tan)
[07:21:21] <logmsgbot>	 !log awight@deploy1003 Started scap sync-world: Backport for [[gerrit:1188398|XLab\ResourceLoader\Hooks: Add stream to XLAB_STREAMS]]
[07:21:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: logrotate.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:21:59] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mailman: add a local disk cache [puppet] - 10https://gerrit.wikimedia.org/r/1188320 (https://phabricator.wikimedia.org/T353891) (owner: 10Arnaudb)
[07:25:47] <awight>	 hueitan: is this one testable?
[07:25:55] <hueitan>	 let me check
[07:26:09] <awight>	 maybe I kafkacat or...
[07:26:43] <awight>	 hueitan: sorry, it's not quite ready to test yet
[07:27:01] <awight>	 I was confusingly asking ahead of time
[07:27:40] <logmsgbot>	 !log awight@deploy1003 hueitan, awight: Backport for [[gerrit:1188398|XLab\ResourceLoader\Hooks: Add stream to XLAB_STREAMS]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:27:43] <jinxer-wm>	 FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:27:43] <jinxer-wm>	 RESOLVED: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:27:58] <jinxer-wm>	 FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:28:02] <icinga-wm>	 PROBLEM - mailman3-web on lists1004 is CRITICAL: PROCS CRITICAL: 14 processes with UID = 33 (www-data), regex args /usr/bin/uwsgi https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:28:11] <hueitan>	 awight i see it live now
[07:28:44] <logmsgbot>	 !log awight@deploy1003 hueitan, awight: Continuing with sync
[07:28:47] <awight>	 hueitan: ack
[07:28:55] <hueitan>	 Thank you, all good
[07:30:02] <icinga-wm>	 RECOVERY - mailman3-web on lists1004 is OK: PROCS OK: 13 processes with UID = 33 (www-data), regex args /usr/bin/uwsgi https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:30:53] <wikibugs>	 (03PS1) 10Arnaudb: Revert "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188696
[07:31:34] <wikibugs>	 (03CR) 10Jelto: [C:03+1] Revert "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188696 (owner: 10Arnaudb)
[07:32:33] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] Revert "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188696 (owner: 10Arnaudb)
[07:32:43] <jinxer-wm>	 FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:32:48] <jinxer-wm>	 FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:33:02] <icinga-wm>	 PROBLEM - mailman3-web on lists1004 is CRITICAL: PROCS CRITICAL: 14 processes with UID = 33 (www-data), regex args /usr/bin/uwsgi https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:33:36] <jinxer-wm>	 RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/2 (Transit: Arelion (IC-308844) {#1071}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:34:31] <logmsgbot>	 !log awight@deploy1003 Finished scap sync-world: Backport for [[gerrit:1188398|XLab\ResourceLoader\Hooks: Add stream to XLAB_STREAMS]] (duration: 13m 10s)
[07:35:02] <icinga-wm>	 RECOVERY - mailman3-web on lists1004 is OK: PROCS OK: 13 processes with UID = 33 (www-data), regex args /usr/bin/uwsgi https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:35:20] <hueitan>	 💯
[07:35:38] <awight>	 !log UTC morning deployments finished
[07:35:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:48] <wikibugs>	 (03PS1) 10Arnaudb: Revert^2 "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188708
[07:35:49] <awight>	 hueitan: Thanks for the help :-)
[07:36:06] <hueitan>	 awight thank you
[07:37:43] <jinxer-wm>	 FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:39:06] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/2 (Transit: Arelion (IC-308844) {#1071}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:42:43] <jinxer-wm>	 FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:44:06] <jinxer-wm>	 RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/2 (Transit: Arelion (IC-308844) {#1071}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:46:51] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] profile:cache:haproxy: copy utf8ps lua converter on cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1188366 (https://phabricator.wikimedia.org/T401383) (owner: 10Fabfur)
[07:47:43] <jinxer-wm>	 FIRING: [17x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:47:43] <jinxer-wm>	 FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:52:38] <wikibugs>	 (03PS1) 10Brouberol: mediawiki-dumps-legacy: add missing client_config_file config in addschange config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188712 (https://phabricator.wikimedia.org/T404162)
[07:52:43] <jinxer-wm>	 RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:52:48] <jinxer-wm>	 FIRING: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:52:53] <jinxer-wm>	 FIRING: [15x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:53:36] <icinga-wm>	 RECOVERY - Backup freshness on backup1014 is OK: Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[07:57:43] <jinxer-wm>	 FIRING: [10x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:57:48] <jinxer-wm>	 RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[08:02:43] <jinxer-wm>	 RESOLVED: [8x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[08:03:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[08:13:59] <jinxer-wm>	 FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:22:04] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-test1008 is CRITICAL: SSL CRITICAL - Certificate kafka-test1008.eqiad.wmnet valid until 2025-09-23 08:22:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[08:22:22] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-test1006 is CRITICAL: SSL CRITICAL - Certificate kafka-test1006.eqiad.wmnet valid until 2025-09-23 08:22:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[08:22:22] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-test1010 is CRITICAL: SSL CRITICAL - Certificate kafka-test1010.eqiad.wmnet valid until 2025-09-23 08:22:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[08:23:22] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-test1007 is CRITICAL: SSL CRITICAL - Certificate kafka-test1007.eqiad.wmnet valid until 2025-09-23 08:23:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[08:24:06] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-main2006 is CRITICAL: SSL CRITICAL - Certificate kafka-main2006.codfw.wmnet valid until 2025-09-23 08:24:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[08:24:48] <wikibugs>	 (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: add missing client_config_file config in addschange config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188712 (https://phabricator.wikimedia.org/T404162) (owner: 10Brouberol)
[08:25:04] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-test1009 is CRITICAL: SSL CRITICAL - Certificate kafka-test1009.eqiad.wmnet valid until 2025-09-23 08:25:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[08:26:59] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: add missing client_config_file config in addschange config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188712 (https://phabricator.wikimedia.org/T404162) (owner: 10Brouberol)
[08:27:06] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-main2007 is CRITICAL: SSL CRITICAL - Certificate kafka-main2007.codfw.wmnet valid until 2025-09-23 08:27:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[08:28:04] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-main2009 is CRITICAL: SSL CRITICAL - Certificate kafka-main2009.codfw.wmnet valid until 2025-09-23 08:28:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[08:31:01] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Apply replica role to maps1012-1014 [puppet] - 10https://gerrit.wikimedia.org/r/1188308 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[08:31:06] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-main2008 is CRITICAL: SSL CRITICAL - Certificate kafka-main2008.codfw.wmnet valid until 2025-09-23 08:31:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[08:32:03] <wikibugs>	 (03CR) 10Elukey: [C:03+2] spicerack/mysql.py: update CORE_SECTIONS to reflect newly added x3 section [software/spicerack] - 10https://gerrit.wikimedia.org/r/1187871 (https://phabricator.wikimedia.org/T404464) (owner: 10Jasmine)
[08:34:53] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for hueitan - https://phabricator.wikimedia.org/T404681 (10hueitan) 03NEW
[08:35:54] <wikibugs>	 (03PS1) 10Gergő Tisza: User: Simplify makeUpdateConditions() [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188715 (https://phabricator.wikimedia.org/T401748)
[08:36:38] <wikibugs>	 (03PS1) 10Gergő Tisza: session: Add a mechanism for forcing a refresh [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188716 (https://phabricator.wikimedia.org/T399200)
[08:37:06] <wikibugs>	 (03PS1) 10Gergő Tisza: Use short expiry for JWT cookies [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188717 (https://phabricator.wikimedia.org/T399200)
[08:37:24] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-main2010 is CRITICAL: SSL CRITICAL - Certificate kafka-main2010.codfw.wmnet valid until 2025-09-23 08:37:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[08:38:10] <wikibugs>	 (03PS1) 10Gergő Tisza: tests: Update for SessionCookieJwtExpiration added in core [extensions/CentralAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188718 (https://phabricator.wikimedia.org/T399200)
[08:38:22] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188715 (https://phabricator.wikimedia.org/T401748) (owner: 10Gergő Tisza)
[08:38:51] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188716 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza)
[08:38:55] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188717 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza)
[08:39:27] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CentralAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188718 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza)
[08:41:33] <wikibugs>	 (03Merged) 10jenkins-bot: spicerack/mysql.py: update CORE_SECTIONS to reflect newly added x3 section [software/spicerack] - 10https://gerrit.wikimedia.org/r/1187871 (https://phabricator.wikimedia.org/T404464) (owner: 10Jasmine)
[08:42:02] <wikibugs>	 (03PS1) 10Slyngshede: Permissions: Prevent duplicate permission requests [software/bitu] - 10https://gerrit.wikimedia.org/r/1188719 (https://phabricator.wikimedia.org/T403691)
[08:45:12] <wikibugs>	 (03PS2) 10Gergő Tisza: Enable JWT session cookies on testwiki and beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186593 (https://phabricator.wikimedia.org/T399631)
[08:45:17] <wikibugs>	 (03CR) 10Gergő Tisza: Enable JWT session cookies on testwiki and beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186593 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza)
[08:45:38] <wikibugs>	 (03PS1) 10Elukey: CHANGELOG: add changelogs for release v11.7.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1188721
[08:45:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Enable JWT session cookies on testwiki and beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186593 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza)
[08:46:32] <wikibugs>	 (03PS2) 10Slyngshede: Permissions: Prevent duplicate permission requests [software/bitu] - 10https://gerrit.wikimedia.org/r/1188719 (https://phabricator.wikimedia.org/T403691)
[08:46:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] tests: Update for SessionCookieJwtExpiration added in core [extensions/CentralAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188718 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza)
[08:47:01] <wikibugs>	 (03PS3) 10Gergő Tisza: Enable JWT session cookies on testwiki and beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186593 (https://phabricator.wikimedia.org/T399631)
[08:47:31] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186593 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza)
[08:48:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[08:53:39] <jinxer-wm>	 FIRING: [4x] TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (187.16.221.197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[08:53:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[08:57:37] <wikibugs>	 (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v11.7.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1188721 (owner: 10Elukey)
[08:58:39] <jinxer-wm>	 FIRING: [4x] TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (187.16.221.197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[08:58:49] <wikibugs>	 (03PS3) 10Effie Mouzeli: P:hcaptcha: add temporary redirect [puppet] - 10https://gerrit.wikimedia.org/r/1188380 (https://phabricator.wikimedia.org/T404251)
[08:59:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Decommission E/F 8 Dell switches - https://phabricator.wikimedia.org/T380050#11183855 (10cmooney) >>! In T380050#10654652, @BCornwall wrote: > Re: https://gerrit.wikimedia.org/r/c/operations/dns/+/1091711/comments/5e6962e8_b88980ce - Do the IPs need to be deleted from netbox?  Y...
[09:00:00] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:04-1] "@kosta, please provide where we define the version, so to add it in the comments and move forward, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1188380 (https://phabricator.wikimedia.org/T404251) (owner: 10Effie Mouzeli)
[09:00:06] <wikibugs>	 (03PS1) 10Elukey: Upstream release v11.7.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1188728
[09:00:18] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:04-1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188380 (https://phabricator.wikimedia.org/T404251) (owner: 10Effie Mouzeli)
[09:01:16] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply
[09:03:01] <wikibugs>	 (03PS2) 10Arnaudb: Revert^2 "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188708 (https://phabricator.wikimedia.org/T353891)
[09:03:39] <wikibugs>	 (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188708 (https://phabricator.wikimedia.org/T353891) (owner: 10Arnaudb)
[09:04:43] <wikibugs>	 (03PS4) 10Effie Mouzeli: P:hcaptcha: add temporary redirect [puppet] - 10https://gerrit.wikimedia.org/r/1188380 (https://phabricator.wikimedia.org/T404251)
[09:05:08] <wikibugs>	 (03CR) 10Effie Mouzeli: "variable is $wgHCaptchaApiUrl" [puppet] - 10https://gerrit.wikimedia.org/r/1188380 (https://phabricator.wikimedia.org/T404251) (owner: 10Effie Mouzeli)
[09:05:24] <wikibugs>	 (03PS3) 10Arnaudb: Revert^2 "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188708 (https://phabricator.wikimedia.org/T353891)
[09:05:40] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] Upstream release v11.7.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1188728 (owner: 10Elukey)
[09:06:14] <wikibugs>	 (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188708 (https://phabricator.wikimedia.org/T353891) (owner: 10Arnaudb)
[09:06:28] <wikibugs>	 (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188708 (https://phabricator.wikimedia.org/T353891) (owner: 10Arnaudb)
[09:06:34] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: deploy: Set HELM_DIFF_OUTPUT_CONTEXT=5 in kube_env.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188731
[09:08:17] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply
[09:08:18] <wikibugs>	 (03PS2) 10Effie Mouzeli: P:hcaptcha: add keepalive_timeout [puppet] - 10https://gerrit.wikimedia.org/r/1187828
[09:08:39] <jinxer-wm>	 RESOLVED: [4x] TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (187.16.221.197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[09:08:52] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] Revert^2 "mailman: add a local disk cache" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1188708 (https://phabricator.wikimedia.org/T353891) (owner: 10Arnaudb)
[09:09:33] <wikibugs>	 (03CR) 10Jelto: [C:03+1] Revert^2 "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188708 (https://phabricator.wikimedia.org/T353891) (owner: 10Arnaudb)
[09:11:23] <wikibugs>	 (03CR) 10Elukey: [C:03+1] deploy: Set HELM_DIFF_OUTPUT_CONTEXT=5 in kube_env.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188731 (owner: 10Alexandros Kosiaris)
[09:12:03] <wikibugs>	 (03CR) 10Effie Mouzeli: P:hcaptcha: add keepalive_timeout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187828 (owner: 10Effie Mouzeli)
[09:12:13] <wikibugs>	 (03PS1) 10Arnaudb: Revert^3 "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188732
[09:12:57] <wikibugs>	 (03PS3) 10Effie Mouzeli: P:hcaptcha: add keepalive_timeout [puppet] - 10https://gerrit.wikimedia.org/r/1187828 (https://phabricator.wikimedia.org/T403416)
[09:12:57] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] Revert^3 "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188732 (owner: 10Arnaudb)
[09:13:27] <wikibugs>	 (03CR) 10Effie Mouzeli: P:hcaptcha: add keepalive_timeout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187828 (https://phabricator.wikimedia.org/T403416) (owner: 10Effie Mouzeli)
[09:23:50] <Dreamy_Jazz>	 jouncebot: nowandnext
[09:23:50] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 36 minute(s)
[09:23:50] <jouncebot>	 In 0 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1000)
[09:24:07] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[09:24:31] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] P:hcaptcha: add temporary redirect [puppet] - 10https://gerrit.wikimedia.org/r/1188380 (https://phabricator.wikimedia.org/T404251) (owner: 10Effie Mouzeli)
[09:24:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[09:24:52] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183943 (10phaultfinder)
[09:25:04] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[09:25:22] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[09:27:05] <elukey>	 !log uploaded spicerack_11.7.0 to apt.wikimedia.org bullseye-wikimedia,bookworm-wikimedia
[09:27:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:13] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] "we alrready do this in scap IIRC." [puppet] - 10https://gerrit.wikimedia.org/r/1188731 (owner: 10Alexandros Kosiaris)
[09:30:40] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM. Tested on in local environment." [puppet] - 10https://gerrit.wikimedia.org/r/1188379 (https://phabricator.wikimedia.org/T401383) (owner: 10Fabfur)
[09:31:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188352 (https://phabricator.wikimedia.org/T404594) (owner: 10Dreamy Jazz)
[09:31:30] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-test-eqiad
[09:31:59] <wikibugs>	 (03Merged) 10jenkins-bot: Document that test2wiki has suggested investigations DB tables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188352 (https://phabricator.wikimedia.org/T404594) (owner: 10Dreamy Jazz)
[09:32:15] <logmsgbot>	 !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1188352|Document that test2wiki has suggested investigations DB tables (T404594)]]
[09:32:19] <stashbot>	 T404594: Create suggested investigation database tables on test2wiki - https://phabricator.wikimedia.org/T404594
[09:32:54] <icinga-wm>	 RECOVERY - Kafka broker TLS certificate validity on kafka-test1006 is OK: SSL OK - Certificate kafka-test1006.eqiad.wmnet valid until 2026-08-23 08:32:00 +0000 (expires in 340 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[09:34:00] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[09:34:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[09:36:27] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1012.eqiad.wmnet with OS trixie
[09:38:19] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1 C:03+2] P:puppetserver::volatile avoid loading Spur data on certain host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184646 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede)
[09:38:20] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1188352|Document that test2wiki has suggested investigations DB tables (T404594)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[09:38:25] <stashbot>	 T404594: Create suggested investigation database tables on test2wiki - https://phabricator.wikimedia.org/T404594
[09:38:42] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync
[09:39:50] <icinga-wm>	 RECOVERY - Kafka broker TLS certificate validity on kafka-test1007 is OK: SSL OK - Certificate kafka-test1007.eqiad.wmnet valid until 2026-08-23 08:34:00 +0000 (expires in 340 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[09:41:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:44:04] <wikibugs>	 (03PS2) 10Gergő Tisza: tests: Update for SessionCookieJwtExpiration added in core [extensions/CentralAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188718 (https://phabricator.wikimedia.org/T399200)
[09:44:09] <logmsgbot>	 !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1188352|Document that test2wiki has suggested investigations DB tables (T404594)]] (duration: 11m 54s)
[09:44:14] <stashbot>	 T404594: Create suggested investigation database tables on test2wiki - https://phabricator.wikimedia.org/T404594
[09:45:37] <wikibugs>	 (03CR) 10Fabfur: "Thanks, another test is always helpful!" [puppet] - 10https://gerrit.wikimedia.org/r/1188379 (https://phabricator.wikimedia.org/T401383) (owner: 10Fabfur)
[09:46:02] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2229.codfw.wmnet with reason: Maintenance
[09:46:09] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2229 (T402925)', diff saved to https://phabricator.wikimedia.org/P83353 and previous config saved to /var/cache/conftool/dbconfig/20250916-094609-ladsgroup.json
[09:46:14] <stashbot>	 T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925
[09:46:25] <icinga-wm>	 RECOVERY - Kafka broker TLS certificate validity on kafka-test1008 is OK: SSL OK - Certificate kafka-test1008.eqiad.wmnet valid until 2026-08-23 08:21:00 +0000 (expires in 340 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[09:47:58] <fabfur>	 !log disable puppet on A:cp to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/1188379 (T401383)
[09:48:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:48:02] <stashbot>	 T401383: Reduce noise from duplicate sequence-gap alerts on HaProxy-webrequests - https://phabricator.wikimedia.org/T401383
[09:48:39] <wikibugs>	 (03PS2) 10Federico Ceratto: es2049.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1185879 (https://phabricator.wikimedia.org/T402859)
[09:48:39] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2220.codfw.wmnet with reason: Maintenance
[09:48:47] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2220 (T402925)', diff saved to https://phabricator.wikimedia.org/P83354 and previous config saved to /var/cache/conftool/dbconfig/20250916-094846-ladsgroup.json
[09:49:48] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS bookworm
[09:50:30] <wikibugs>	 06SRE: Allow Puppet to pull in XCHEESESCORE git repo - https://phabricator.wikimedia.org/T404688 (10SLyngshede-WMF) 03NEW
[09:50:48] <wikibugs>	 06SRE: Allow Puppet to pull in XCHEESESCORE git repo - https://phabricator.wikimedia.org/T404688#11184085 (10SLyngshede-WMF) p:05Triage→03High
[09:52:08] <logmsgbot>	 !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2010.codfw.wmnet with OS bookworm
[09:52:15] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] haproxy: use utf8ps converter on received headers [puppet] - 10https://gerrit.wikimedia.org/r/1188379 (https://phabricator.wikimedia.org/T401383) (owner: 10Fabfur)
[09:52:25] <wikibugs>	 06SRE: Allow Puppet to pull in XCHEESESCORE git repo - https://phabricator.wikimedia.org/T404688#11184097 (10SLyngshede-WMF)
[09:52:55] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[09:52:59] <icinga-wm>	 RECOVERY - Kafka broker TLS certificate validity on kafka-test1009 is OK: SSL OK - Certificate kafka-test1009.eqiad.wmnet valid until 2026-08-23 08:32:00 +0000 (expires in 340 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[09:53:57] <icinga-wm>	 RECOVERY - Kafka broker TLS certificate validity on kafka-test1010 is OK: SSL OK - Certificate kafka-test1010.eqiad.wmnet valid until 2026-08-23 08:23:00 +0000 (expires in 340 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[09:54:21] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-test-eqiad
[09:56:28] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#11184115 (10Jelto) My proposal to move forward is to sync the files from object storage to a local folder on the GitLab host. Ideal...
[09:57:02] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[09:58:54] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1012.eqiad.wmnet with reason: host reimage
[09:59:58] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] P:hcaptcha: unset more headers [puppet] - 10https://gerrit.wikimedia.org/r/1188367 (https://phabricator.wikimedia.org/T403416) (owner: 10Effie Mouzeli)
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1000)
[10:00:04] <jouncebot>	 claime: A patch you scheduled for MediaWiki infrastructure (UTC mid-day) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[10:01:01] <claime>	 jynus: volans: heads up, I'm going to start deploying a change to multi-dc.lua on cp nodes https://gerrit.wikimedia.org/r/c/1182815/
[10:01:19] <claime>	 cc fabfur ^
[10:01:53] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T402925)', diff saved to https://phabricator.wikimedia.org/P83355 and previous config saved to /var/cache/conftool/dbconfig/20250916-100152-ladsgroup.json
[10:01:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:01:58] <stashbot>	 T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925
[10:02:16] <wikibugs>	 (03PS1) 10Brouberol: deployment_server: allow different namespaces to be deployed within a same cluster group [puppet] - 10https://gerrit.wikimedia.org/r/1188734 (https://phabricator.wikimedia.org/T404068)
[10:02:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[10:03:09] <wikibugs>	 (03PS2) 10Brouberol: deployment_server: allow different namespaces to be deployed within a group [puppet] - 10https://gerrit.wikimedia.org/r/1188734 (https://phabricator.wikimedia.org/T404068)
[10:03:44] <jinxer-wm>	 FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[10:04:16] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1012.eqiad.wmnet with reason: host reimage
[10:04:17] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] P:hcaptcha: add temporary redirect [puppet] - 10https://gerrit.wikimedia.org/r/1188380 (https://phabricator.wikimedia.org/T404251) (owner: 10Effie Mouzeli)
[10:06:13] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS bookworm
[10:06:40] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11184198 (10Joe) >>! In T400119#11115049, @TheDJ wrote: > Yeah getting the swagger spec via `curl https://api.wikimedia.org/core/v1/wikipedia/en/search/pag...
[10:08:30] <logmsgbot>	 !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2010.codfw.wmnet with OS bookworm
[10:09:55] <wikibugs>	 (03PS1) 10Effie Mouzeli: P:hcaptcha: typo (oops) [puppet] - 10https://gerrit.wikimedia.org/r/1188737
[10:10:32] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - proxoid_4260: Servers urldownloader1004.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:10:32] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - proxoid_4260: Servers urldownloader1004.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:10:42] <fabfur>	 !log tests looks good, enable puppet on A:cp (T401383)
[10:10:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:46] <stashbot>	 T401383: Reduce noise from duplicate sequence-gap alerts on HaProxy-webrequests - https://phabricator.wikimedia.org/T401383
[10:11:17] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] P:hcaptcha: typo (oops) [puppet] - 10https://gerrit.wikimedia.org/r/1188737 (owner: 10Effie Mouzeli)
[10:11:44] <effie>	 ^^ that is me 
[10:11:51] <effie>	 it is ok 
[10:12:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: nginx.service on urldownloader1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:13:32] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:14:00] <jinxer-wm>	 FIRING: [14x] CertAlmostExpired: Certificate for service cloudsw1-b1-codfw.mgmt.codfw.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[10:14:21] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T402925)', diff saved to https://phabricator.wikimedia.org/P83356 and previous config saved to /var/cache/conftool/dbconfig/20250916-101420-ladsgroup.json
[10:14:25] <stashbot>	 T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925
[10:15:01] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187033 (https://phabricator.wikimedia.org/T404218) (owner: 10Anzx)
[10:15:04] <wikibugs>	 (03PS3) 10Kosta Harlan: hCaptcha: Set wgHCaptchaApiUrlIntegrityHash and pin secure-api.js version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187079 (https://phabricator.wikimedia.org/T404251)
[10:15:32] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:17:01] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P83357 and previous config saved to /var/cache/conftool/dbconfig/20250916-101700-ladsgroup.json
[10:17:04] <claime>	 fabfur: ah, you're deploying things on cp nodes?
[10:17:16] <claime>	 Should I wait a little for https://gerrit.wikimedia.org/r/c/1182815/ ?
[10:17:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: nginx.service on urldownloader1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:17:46] <wikibugs>	 (03CR) 10Muehlenhoff: "That seems fine, only the the ipblocks/abuse hierarchy is sourced by the ferm requestctl support and those rules are mostly made in reacti" [puppet] - 10https://gerrit.wikimedia.org/r/1188300 (https://phabricator.wikimedia.org/T402014) (owner: 10JMeybohm)
[10:18:00] <fabfur>	 claime: I reenabled puppet on A:cp, no problem on my side to proceed with other changes
[10:18:17] <fabfur>	 but thanks for noticing!
[10:18:44] <claime>	 fabfur: ack, but since I'll have to re-disable puppet on A:cp, I may still need to wait
[10:18:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[10:18:53] <claime>	 otherwise your change may not deploy in isolation
[10:19:32] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1012.eqiad.wmnet with OS trixie
[10:23:49] <fabfur>	 it's ok  for me
[10:24:02] <jynus>	 so, the ongoing recovery of restarted is claime and the recovery of urldownloader was effie, right?
[10:24:11] <claime>	 No, I've touched nothing yet
[10:24:13] <wikibugs>	 (03PS1) 10Federico Ceratto: preseed.yaml: Remove es2050 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1188738 (https://phabricator.wikimedia.org/T402859)
[10:24:14] <jynus>	 ah
[10:25:28] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] deploy: Set HELM_DIFF_OUTPUT_CONTEXT=5 in kube_env.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188731 (owner: 10Alexandros Kosiaris)
[10:25:33] <claime>	 anything urldownloader is effie rn though :)
[10:27:38] <claime>	 !log sudo cumin 'A:cp' "disable-puppet 'Deploying multi-dc.lua changes - T402412 - ${USER}'"
[10:27:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:43] <stashbot>	 T402412: Route test2wiki rest.php APIs through rest-gateway  - https://phabricator.wikimedia.org/T402412
[10:28:11] <volans>	 riposoqualita@gmail.com
[10:28:16] <volans>	 ops, bad paste
[10:28:25] <volans>	 👀
[10:29:11] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:29:29] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P83358 and previous config saved to /var/cache/conftool/dbconfig/20250916-102928-ladsgroup.json
[10:29:37] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] multi-dc: Dynamic rewrite to -ro destinations [puppet] - 10https://gerrit.wikimedia.org/r/1182815 (https://phabricator.wikimedia.org/T402412) (owner: 10Clément Goubert)
[10:29:58] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11184292 (10elukey) 05Resolved→03Open >>! In T394357#11162710, @MatthewVernon wrote: > Hi @Jhancock.wm / @elukey . I've found 2 show-stoppers thus far (the second of which has...
[10:30:27] <wikibugs>	 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11184299 (10elukey) The host doesn't PXE/HTTP boot for some reason, I reopened the provision task in T394357#11184292.
[10:31:01] <claime>	 !log Enabling puppet for testing on cp6011 and cp2041 - T402412 - T400131
[10:31:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:09] <stashbot>	 T400131: Improved API rerouting strategy for REST gateway - https://phabricator.wikimedia.org/T400131
[10:31:44] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS bookworm
[10:32:08] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P83359 and previous config saved to /var/cache/conftool/dbconfig/20250916-103208-ladsgroup.json
[10:33:23] <effie>	 jynus: yes we are alright 
[10:33:39] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#11184312 (10jcrespo) If that unblocks you, I am ok with that- sadly because other priorities keep entering data persistence with un...
[10:36:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:37:03] <logmsgbot>	 elukey@cumin1003 reimage (PID 2807097) is awaiting input
[10:42:29] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS bookworm
[10:43:13] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[10:44:37] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P83360 and previous config saved to /var/cache/conftool/dbconfig/20250916-104436-ladsgroup.json
[10:45:33] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] es2049.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1185879 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto)
[10:46:09] <claime>	 My tests look good but something looks wrong-ish with the rest-gateway
[10:46:26] <claime>	 it's serving 30 5xx per second since a bit past 0955
[10:47:09] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[10:47:16] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T402925)', diff saved to https://phabricator.wikimedia.org/P83361 and previous config saved to /var/cache/conftool/dbconfig/20250916-104715-ladsgroup.json
[10:47:20] <stashbot>	 T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925
[10:48:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[10:49:15] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[10:50:02] <claime>	 Ugh proton again
[10:50:24] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11184382 (10cmooney) Hey @papaul yeah Thursday will be fine thanks.
[10:52:16] <claime>	  I'm moving forward despite this, I'll diagnose it in parallel, it's unrelated to the change
[10:52:31] <claime>	 !log sudo cumin 'A:cp' "enable-puppet 'Deploying multi-dc.lua changes - T402412 - ${USER}'"
[10:52:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:36] <stashbot>	 T402412: Route test2wiki rest.php APIs through rest-gateway  - https://phabricator.wikimedia.org/T402412
[10:53:09] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[10:56:48] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] es2049.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1185879 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto)
[10:57:21] <logmsgbot>	 elukey@cumin1003 interactive (PID 2810037) is awaiting input
[10:57:43] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for mszwarc - https://phabricator.wikimedia.org/T404697 (10mszwarc) 03NEW
[10:59:44] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T402925)', diff saved to https://phabricator.wikimedia.org/P83363 and previous config saved to /var/cache/conftool/dbconfig/20250916-105944-ladsgroup.json
[10:59:49] <stashbot>	 T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925
[11:00:36] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for mszwarc - https://phabricator.wikimedia.org/T404697#11184443 (10OKryva-WMF) As Marcin's Engineering Manager, approve.
[11:18:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "update for routed Ganeti - jmm@cumin2002"
[11:18:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "update for routed Ganeti - jmm@cumin2002"
[11:19:00] <jinxer-wm>	 FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[11:21:26] <jinxer-wm>	 FIRING: SystemdUnitFailed: logrotate.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:23:04] <wikibugs>	 (03PS4) 10Stevemunene: dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433)
[11:25:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) (owner: 10Stevemunene)
[11:27:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[11:31:44] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:31:44] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:33:22] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good to me, but I would like to make sure that others are also able to review, for visibility." [puppet] - 10https://gerrit.wikimedia.org/r/1188734 (https://phabricator.wikimedia.org/T404068) (owner: 10Brouberol)
[11:39:23] <wikibugs>	 (03PS1) 10Kosta Harlan: P:hcaptcha: Adjust regex match [puppet] - 10https://gerrit.wikimedia.org/r/1188750 (https://phabricator.wikimedia.org/T404251)
[11:39:31] <wikibugs>	 (03PS1) 10Clément Goubert: Revert^2 "trafficserver: test2wiki rest API to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1188751
[11:40:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install3004.wikimedia.org
[11:41:42] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 8.218 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:41:42] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 8.385 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:41:47] <wikibugs>	 (03PS2) 10Kosta Harlan: P:hcaptcha: Adjust regex match [puppet] - 10https://gerrit.wikimedia.org/r/1188750 (https://phabricator.wikimedia.org/T404251)
[11:42:09] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] Revert^2 "trafficserver: test2wiki rest API to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1188751 (owner: 10Clément Goubert)
[11:43:08] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] P:hcaptcha: Adjust regex match [puppet] - 10https://gerrit.wikimedia.org/r/1188750 (https://phabricator.wikimedia.org/T404251) (owner: 10Kosta Harlan)
[11:44:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:hcaptcha: Adjust regex match [puppet] - 10https://gerrit.wikimedia.org/r/1188750 (https://phabricator.wikimedia.org/T404251) (owner: 10Kosta Harlan)
[11:45:48] <wikibugs>	 (03PS3) 10Kosta Harlan: P:hcaptcha: Adjust regex match [puppet] - 10https://gerrit.wikimedia.org/r/1188750 (https://phabricator.wikimedia.org/T404251)
[11:47:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install3004.wikimedia.org
[11:48:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:hcaptcha: Adjust regex match [puppet] - 10https://gerrit.wikimedia.org/r/1188750 (https://phabricator.wikimedia.org/T404251) (owner: 10Kosta Harlan)
[11:48:59] <jinxer-wm>	 RESOLVED: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:52:33] <wikibugs>	 (03PS4) 10Kosta Harlan: P:hcaptcha: Adjust regex match [puppet] - 10https://gerrit.wikimedia.org/r/1188750 (https://phabricator.wikimedia.org/T404251)
[11:54:51] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] Revert^2 "trafficserver: test2wiki rest API to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1188751 (owner: 10Clément Goubert)
[11:54:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11184626 (10phaultfinder)
[11:58:51] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11184633 (10elukey) There is something definitely off, I just tested the following and everything hangs:  ` -> reset /system1/pwrmgtsvc1 /system1/pwrmgtsvc1 `  I am trying to set...
[11:59:40] <wikibugs>	 (03PS1) 10Hnowlan: (api|rest)-gateway: set Server header if supplied by service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188758 (https://phabricator.wikimedia.org/T401396)
[12:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1200)
[12:00:47] <wikibugs>	 (03Abandoned) 10Kosta Harlan: hCaptcha: Special handling for hcaptcha-secure-api.js requests [puppet] - 10https://gerrit.wikimedia.org/r/1187439 (https://phabricator.wikimedia.org/T404251) (owner: 10Kosta Harlan)
[12:03:44] <jinxer-wm>	 FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[12:05:46] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11184653 (10MoritzMuehlenhoff) There was a small issue with install3004, it lacked the global ipv6 address, which caused failing ipv6 probes to Squid. The rele...
[12:08:40] <wikibugs>	 (03PS1) 10Esanders: Enable Flow in read-only mode on wikis using LiquidThreads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188759 (https://phabricator.wikimedia.org/T404687)
[12:08:43] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Rebalance s7 in eqiad (T403966)', diff saved to https://phabricator.wikimedia.org/P83364 and previous config saved to /var/cache/conftool/dbconfig/20250916-120842-ladsgroup.json
[12:08:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[12:08:48] <stashbot>	 T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966
[12:09:51] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Enable Flow in read-only mode on wikis using LiquidThreads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188759 (https://phabricator.wikimedia.org/T404687) (owner: 10Esanders)
[12:15:45] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1194 from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83365 and previous config saved to /var/cache/conftool/dbconfig/20250916-121545-ladsgroup.json
[12:15:50] <stashbot>	 T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966
[12:18:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11184704 (10cmooney) >>! In T404609#11181649, @RobH wrote: > @cmooney: What do you think is the best way to go about migrating these connections on upcoming C...
[12:18:20] <claime>	 !log depooling cp2041 - T402412
[12:18:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:25] <stashbot>	 T402412: Route test2wiki rest.php APIs through rest-gateway  - https://phabricator.wikimedia.org/T402412
[12:19:58] <wikibugs>	 (03PS2) 10Hnowlan: (api|rest)-gateway: set Server header if supplied by service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188758 (https://phabricator.wikimedia.org/T401396)
[12:22:03] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11184711 (10cmooney) @RobH @Jclark-ctr there is also another way we could try to approach this so may as well mention it now before we start planning.  Rack-b...
[12:37:06] <wikibugs>	 (03PS1) 10Huei Tan: xLab: Fix instrument to produce valid events [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188765 (https://phabricator.wikimedia.org/T404420)
[12:37:45] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188765 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan)
[12:40:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11184793 (10Jclark-ctr) @cmooney I’m flexible to try either way. Maybe a mix could work? We could start with roles that aren’t single points of failure and ar...
[12:46:38] <wikibugs>	 (03PS1) 10Urbanecm: feat: Allow communities to opt out experienced users from mentorship [extensions/GrowthExperiments] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188767 (https://phabricator.wikimedia.org/T403563)
[12:46:46] <urbanecm>	 jouncebot: nowandnext
[12:46:46] <jouncebot>	 For the next 0 hour(s) and 13 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1200)
[12:46:46] <jouncebot>	 In 0 hour(s) and 13 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1300)
[12:48:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[12:49:14] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[12:49:38] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[12:50:58] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[12:53:22] <wikibugs>	 (03PS2) 10Federico Ceratto: preseed.yaml: Remove es2050 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1188738 (https://phabricator.wikimedia.org/T402859)
[12:53:22] <wikibugs>	 (03PS1) 10Federico Ceratto: instances.yaml: add es2049 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1188769 (https://phabricator.wikimedia.org/T402859)
[12:56:11] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] P:hcaptcha: Adjust regex match [puppet] - 10https://gerrit.wikimedia.org/r/1188750 (https://phabricator.wikimedia.org/T404251) (owner: 10Kosta Harlan)
[12:57:52] <wikibugs>	 (03PS1) 10Slyngshede: P:puppetserver::volatile xcheese [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688)
[12:58:37] <hueitan>	 o/ i need someone help with my patch deployment.
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1300).
[13:00:05] <jouncebot>	 joelyrookewmde, tgr, anzx, and hueitan: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:15] <hueitan>	 o/ i need someone help with my patch deployment.
[13:00:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:puppetserver::volatile xcheese [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede)
[13:00:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1188374 (https://phabricator.wikimedia.org/T317362) (owner: 10Filippo Giunchedi)
[13:01:00] <tgr>	 I can deploy
[13:01:27] <anzx>	 o/
[13:01:29] <hueitan>	 tgr tq
[13:01:33] <Lucas_WMDE>	 o/
[13:01:42] <Lucas_WMDE>	 I’m in a meeting, thanks tgr for deploying :)
[13:02:02] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1253 from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83366 and previous config saved to /var/cache/conftool/dbconfig/20250916-130201-ladsgroup.json
[13:02:06] <stashbot>	 T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966
[13:03:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187033 (https://phabricator.wikimedia.org/T404218) (owner: 10Anzx)
[13:03:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184480 (https://phabricator.wikimedia.org/T395674) (owner: 10Joely Rooke WMDE)
[13:05:01] <wikibugs>	 (03Merged) 10jenkins-bot: Lift IP cap for workshop at University of Pretoria on 29-30 September [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187033 (https://phabricator.wikimedia.org/T404218) (owner: 10Anzx)
[13:05:04] <wikibugs>	 (03Merged) 10jenkins-bot: Remove feature flag to resolve changelist wikibase link labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184480 (https://phabricator.wikimedia.org/T395674) (owner: 10Joely Rooke WMDE)
[13:05:21] <logmsgbot>	 !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1187033|Lift IP cap for workshop at University of Pretoria on 29-30 September (T404218)]], [[gerrit:1184480|Remove feature flag to resolve changelist wikibase link labels (T395674)]]
[13:05:26] <stashbot>	 T404218: Request for IP exemption for event with University of Pretoria on 2025-09-29 - https://phabricator.wikimedia.org/T404218
[13:05:27] <stashbot>	 T395674: Post-acceptance cleanup for adding labels to Wikidata recent changes - https://phabricator.wikimedia.org/T395674
[13:06:19] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1191 from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83367 and previous config saved to /var/cache/conftool/dbconfig/20250916-130618-ladsgroup.json
[13:09:35] <wikibugs>	 06SRE, 06FR-donorrelations, 06Infrastructure-Foundations, 10Mail: Donations@ doesn't forward to donate@ - https://phabricator.wikimedia.org/T403986#11184909 (10AMJohnson) 05Open→03Resolved a:03AMJohnson @DSeyfert_WMF was able to fix this for us. Thank you, Dustin! Going ahead and closing out this...
[13:09:36] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1202 from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83368 and previous config saved to /var/cache/conftool/dbconfig/20250916-130935-ladsgroup.json
[13:09:41] <stashbot>	 T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966
[13:10:58] <logmsgbot>	 !log tgr@deploy1003 tgr, joelyrookewmde, anzx: Backport for [[gerrit:1187033|Lift IP cap for workshop at University of Pretoria on 29-30 September (T404218)]], [[gerrit:1184480|Remove feature flag to resolve changelist wikibase link labels (T395674)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:11:04] <stashbot>	 T404218: Request for IP exemption for event with University of Pretoria on 2025-09-29 - https://phabricator.wikimedia.org/T404218
[13:11:04] <stashbot>	 T395674: Post-acceptance cleanup for adding labels to Wikidata recent changes - https://phabricator.wikimedia.org/T395674
[13:11:25] <anzx>	 tgr: nothing to test on throttle 
[13:11:52] <tgr>	 joelyrookewmde: I assume you don't need to test either?
[13:11:53] <joelyrookewmde>	 @tgr sorry I missed the start of this deployment. Thanks for approving it !
[13:12:02] <joelyrookewmde>	 no all good for me
[13:12:10] <logmsgbot>	 !log tgr@deploy1003 tgr, joelyrookewmde, anzx: Continuing with sync
[13:13:45] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Rebalance s7 in codfw (T403966)', diff saved to https://phabricator.wikimedia.org/P83369 and previous config saved to /var/cache/conftool/dbconfig/20250916-131345-ladsgroup.json
[13:14:25] <tgr>	 hueitan: do you feel confident about your patch? if it's low-risk, I'll bundle it with the other backports
[13:14:39] <hueitan>	 yes, confident
[13:15:21] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1188734 (https://phabricator.wikimedia.org/T404068) (owner: 10Brouberol)
[13:17:35] <logmsgbot>	 !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1187033|Lift IP cap for workshop at University of Pretoria on 29-30 September (T404218)]], [[gerrit:1184480|Remove feature flag to resolve changelist wikibase link labels (T395674)]] (duration: 12m 14s)
[13:17:41] <stashbot>	 T404218: Request for IP exemption for event with University of Pretoria on 2025-09-29 - https://phabricator.wikimedia.org/T404218
[13:17:42] <stashbot>	 T395674: Post-acceptance cleanup for adding labels to Wikidata recent changes - https://phabricator.wikimedia.org/T395674
[13:17:44] <wikibugs>	 (03CR) 10Phuedx: [C:03+1] xLab: Fix instrument to produce valid events [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188765 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan)
[13:19:45] <wikibugs>	 (03CR) 10Phuedx: [C:03+1] xLab: Fix instrument to produce valid events (031 comment) [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188765 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan)
[13:20:15] <phuedx>	 tgr: Just reviewed it. It's low risk and will reduce event validation errors back to the baseline rate
[13:20:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188715 (https://phabricator.wikimedia.org/T401748) (owner: 10Gergő Tisza)
[13:20:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188716 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza)
[13:20:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188717 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza)
[13:20:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188718 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza)
[13:20:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188765 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan)
[13:22:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM, one typo inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/1188719 (https://phabricator.wikimedia.org/T403691) (owner: 10Slyngshede)
[13:22:25] <wikibugs>	 (03PS1) 10Clément Goubert: Revert^3 "trafficserver: test2wiki rest API to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1188774
[13:22:37] <wikibugs>	 (03CR) 10Clément Goubert: [V:03+2 C:03+2] Revert^3 "trafficserver: test2wiki rest API to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1188774 (owner: 10Clément Goubert)
[13:23:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] profile: clean up root-authorized-key.erb transition [puppet] - 10https://gerrit.wikimedia.org/r/1188374 (https://phabricator.wikimedia.org/T317362) (owner: 10Filippo Giunchedi)
[13:23:29] <kostajh>	 jouncebot: nowandnext
[13:23:29] <jouncebot>	 For the next 0 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1300)
[13:23:29] <jouncebot>	 In 0 hour(s) and 36 minute(s): Metrics Platform Experimentation Lab Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1400)
[13:23:47] <wikibugs>	 (03PS2) 10Slyngshede: P:puppetserver::volatile xcheese [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688)
[13:24:07] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[13:25:20] <wikibugs>	 (03PS1) 10Andrew Bogott: codfw1dev: bump horizon build version [puppet] - 10https://gerrit.wikimedia.org/r/1188776
[13:25:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:25:52] <wikibugs>	 (03Merged) 10jenkins-bot: User: Simplify makeUpdateConditions() [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188715 (https://phabricator.wikimedia.org/T401748) (owner: 10Gergő Tisza)
[13:25:56] <wikibugs>	 (03Merged) 10jenkins-bot: session: Add a mechanism for forcing a refresh [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188716 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza)
[13:26:02] <wikibugs>	 (03Merged) 10jenkins-bot: Use short expiry for JWT cookies [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188717 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza)
[13:26:04] <wikibugs>	 (03Merged) 10jenkins-bot: tests: Update for SessionCookieJwtExpiration added in core [extensions/CentralAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188718 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza)
[13:26:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:puppetserver::volatile xcheese [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede)
[13:27:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] codfw1dev: bump horizon build version [puppet] - 10https://gerrit.wikimedia.org/r/1188776 (owner: 10Andrew Bogott)
[13:29:22] <wikibugs>	 (03Merged) 10jenkins-bot: xLab: Fix instrument to produce valid events [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188765 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan)
[13:29:43] <logmsgbot>	 !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1188715|User: Simplify makeUpdateConditions() (T401748)]], [[gerrit:1188716|session: Add a mechanism for forcing a refresh (T399200)]], [[gerrit:1188717|Use short expiry for JWT cookies (T399200)]], [[gerrit:1188718|tests: Update for SessionCookieJwtExpiration added in core (T399200 T404667)]], [[gerrit:1188765|xLab: Fix instrument to produce valid events
[13:29:43] <logmsgbot>	 (T404420)]]
[13:29:51] <stashbot>	 T401748: Unexpected Phan SecurityCheck failure in UpdateQueryBuilder::execute - https://phabricator.wikimedia.org/T401748
[13:29:52] <stashbot>	 T399200: Update existing cookie-based sessions to include JWT cookie - https://phabricator.wikimedia.org/T399200
[13:29:52] <stashbot>	 T404667: CentralAuth tests failing - https://phabricator.wikimedia.org/T404667
[13:29:53] <stashbot>	 T404420: Enable 13 wikis for MinT for Wiki Readers A/A test - https://phabricator.wikimedia.org/T404420
[13:30:15] <wikibugs>	 (03PS4) 10Kosta Harlan: hCaptcha: Set wgHCaptchaApiUrlIntegrityHash and pin secure-api.js version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187079 (https://phabricator.wikimedia.org/T404251)
[13:30:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:31:51] <wikibugs>	 (03PS2) 10Majavah: hieradata: Drop old eqiad1 bastions [puppet] - 10https://gerrit.wikimedia.org/r/1187804 (https://phabricator.wikimedia.org/T392689)
[13:31:51] <wikibugs>	 (03PS1) 10Majavah: hieradata: openstack: Update Toolforge bastion example [puppet] - 10https://gerrit.wikimedia.org/r/1188778 (https://phabricator.wikimedia.org/T392510)
[13:32:51] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Drop old eqiad1 bastions [puppet] - 10https://gerrit.wikimedia.org/r/1187804 (https://phabricator.wikimedia.org/T392689) (owner: 10Majavah)
[13:33:02] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: openstack: Update Toolforge bastion example [puppet] - 10https://gerrit.wikimedia.org/r/1188778 (https://phabricator.wikimedia.org/T392510) (owner: 10Majavah)
[13:33:21] <wikibugs>	 (03PS3) 10Slyngshede: P:puppetserver::volatile xcheese [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688)
[13:33:59] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[13:35:15] <claime>	 !log repooling cp2041, test inconclusive, rolled back - T402412
[13:35:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:21] <stashbot>	 T402412: Route test2wiki rest.php APIs through rest-gateway  - https://phabricator.wikimedia.org/T402412
[13:35:40] <logmsgbot>	 !log tgr@deploy1003 hueitan, tgr: Backport for [[gerrit:1188715|User: Simplify makeUpdateConditions() (T401748)]], [[gerrit:1188716|session: Add a mechanism for forcing a refresh (T399200)]], [[gerrit:1188717|Use short expiry for JWT cookies (T399200)]], [[gerrit:1188718|tests: Update for SessionCookieJwtExpiration added in core (T399200 T404667)]], [[gerrit:1188765|xLab: Fix instrument to produce valid events (T404420)]]
[13:35:40] <logmsgbot>	 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:35:48] <stashbot>	 T401748: Unexpected Phan SecurityCheck failure in UpdateQueryBuilder::execute - https://phabricator.wikimedia.org/T401748
[13:35:49] <stashbot>	 T399200: Update existing cookie-based sessions to include JWT cookie - https://phabricator.wikimedia.org/T399200
[13:35:49] <stashbot>	 T404667: CentralAuth tests failing - https://phabricator.wikimedia.org/T404667
[13:35:50] <stashbot>	 T404420: Enable 13 wikis for MinT for Wiki Readers A/A test - https://phabricator.wikimedia.org/T404420
[13:36:51] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:puppetserver::volatile xcheese [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede)
[13:38:35] <hueitan>	 (y)
[13:40:54] <wikibugs>	 (03PS4) 10Slyngshede: P:puppetserver::volatile xcheese [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688)
[13:41:58] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1017.eqiad.wmnet with OS bookworm
[13:43:23] <logmsgbot>	 !log tgr@deploy1003 hueitan, tgr: Continuing with sync
[13:43:28] <wikibugs>	 (03PS1) 10Andrew Bogott: Prepare cloudcephosd1017 for bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1188782 (https://phabricator.wikimedia.org/T404249)
[13:44:38] <wikibugs>	 (03PS3) 10Majavah: P:toolforge: remove support for grid bastions [puppet] - 10https://gerrit.wikimedia.org/r/1012752 (https://phabricator.wikimedia.org/T314665)
[13:44:39] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::checker: Remove grid base profile [puppet] - 10https://gerrit.wikimedia.org/r/1188783 (https://phabricator.wikimedia.org/T314664)
[13:44:40] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: Cleanup buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188784
[13:45:03] <wikibugs>	 (03PS1) 10Andrew Bogott: Prepare cloudcephosd105* for bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1188785 (https://phabricator.wikimedia.org/T404249)
[13:45:12] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Prepare cloudcephosd1017 for bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1188782 (https://phabricator.wikimedia.org/T404249) (owner: 10Andrew Bogott)
[13:48:27] <wikibugs>	 (03PS4) 10Superpes15: Throttle exemption for Editathon by Wikimedistas en Cruce - 26 September 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188493 (https://phabricator.wikimedia.org/T404592)
[13:48:33] <logmsgbot>	 !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1188715|User: Simplify makeUpdateConditions() (T401748)]], [[gerrit:1188716|session: Add a mechanism for forcing a refresh (T399200)]], [[gerrit:1188717|Use short expiry for JWT cookies (T399200)]], [[gerrit:1188718|tests: Update for SessionCookieJwtExpiration added in core (T399200 T404667)]], [[gerrit:1188765|xLab: Fix instrument to produce valid events
[13:48:33] <logmsgbot>	 (T404420)]] (duration: 18m 50s)
[13:48:41] <stashbot>	 T401748: Unexpected Phan SecurityCheck failure in UpdateQueryBuilder::execute - https://phabricator.wikimedia.org/T401748
[13:48:42] <stashbot>	 T399200: Update existing cookie-based sessions to include JWT cookie - https://phabricator.wikimedia.org/T399200
[13:48:42] <stashbot>	 T404667: CentralAuth tests failing - https://phabricator.wikimedia.org/T404667
[13:48:43] <wikibugs>	 (03CR) 10Bking: [C:03+2] admin_ng: allow opensearch deploy to use role/rolebinding [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188446 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[13:48:43] <stashbot>	 T404420: Enable 13 wikis for MinT for Wiki Readers A/A test - https://phabricator.wikimedia.org/T404420
[13:50:44] <wikibugs>	 (03PS18) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240)
[13:50:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186593 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza)
[13:51:11] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[13:51:29] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[13:51:37] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
[13:51:58] <wikibugs>	 (03Merged) 10jenkins-bot: Enable JWT session cookies on testwiki and beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186593 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza)
[13:52:11] <logmsgbot>	 !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1186593|Enable JWT session cookies on testwiki and beta (T399631)]]
[13:52:15] <stashbot>	 T399631: Deploy JWT cookies to production - https://phabricator.wikimedia.org/T399631
[13:52:39] <wikibugs>	 (03PS4) 10Majavah: P:toolforge: Remove support for grid bastions [puppet] - 10https://gerrit.wikimedia.org/r/1012752 (https://phabricator.wikimedia.org/T314665)
[13:52:39] <wikibugs>	 (03PS2) 10Majavah: P:toolforge: Cleanup buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188784
[13:53:29] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
[13:53:59] <jinxer-wm>	 FIRING: [17x] CertAlmostExpired: Certificate for service cloudsw1-b1-codfw.mgmt.codfw.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[13:54:08] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6955/co" [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede)
[13:54:34] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Bump vslow replicas of s4 in eqiad to 300 (T403966)', diff saved to https://phabricator.wikimedia.org/P83370 and previous config saved to /var/cache/conftool/dbconfig/20250916-135433-ladsgroup.json
[13:54:38] <stashbot>	 T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966
[13:55:25] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Prepare cloudcephosd105* for bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1188785 (https://phabricator.wikimedia.org/T404249) (owner: 10Andrew Bogott)
[13:55:42] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1160 (candidate master of s4) from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83371 and previous config saved to /var/cache/conftool/dbconfig/20250916-135542-ladsgroup.json
[13:57:40] <logmsgbot>	 !log tgr@deploy1003 tgr: Backport for [[gerrit:1186593|Enable JWT session cookies on testwiki and beta (T399631)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:57:45] <stashbot>	 T399631: Deploy JWT cookies to production - https://phabricator.wikimedia.org/T399631
[13:58:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1188784 (owner: 10Majavah)
[13:58:39] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11185177 (10Jhancock.wm) @elukey 2049 was powered off. once i powered it on the nic came up.  I'll not set the root for 2053-8
[13:58:59] <jinxer-wm>	 FIRING: [19x] CertAlmostExpired: Certificate for service cloudsw1-b1-codfw.mgmt.codfw.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[14:00:21] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1199 from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83372 and previous config saved to /var/cache/conftool/dbconfig/20250916-140020-ladsgroup.json
[14:00:26] <stashbot>	 T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966
[14:01:48] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1247 from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83373 and previous config saved to /var/cache/conftool/dbconfig/20250916-140147-ladsgroup.json
[14:01:59] <wikibugs>	 (03PS1) 10Majavah: kubeadm: Explicitely install kubelet [puppet] - 10https://gerrit.wikimedia.org/r/1188788
[14:02:38] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Fix db1242 weight in s4  (T403966)', diff saved to https://phabricator.wikimedia.org/P83374 and previous config saved to /var/cache/conftool/dbconfig/20250916-140237-ladsgroup.json
[14:02:46] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "LGTM, but please keep in mind that files/certs already created on the deployment servers will not be cleaned up. You might want to do so m" [puppet] - 10https://gerrit.wikimedia.org/r/1188734 (https://phabricator.wikimedia.org/T404068) (owner: 10Brouberol)
[14:03:08] <wikibugs>	 (03CR) 10David Caro: [C:03+1] kubeadm: Explicitely install kubelet [puppet] - 10https://gerrit.wikimedia.org/r/1188788 (owner: 10Majavah)
[14:03:13] <wikibugs>	 (03CR) 10Majavah: [C:03+2] kubeadm: Explicitely install kubelet [puppet] - 10https://gerrit.wikimedia.org/r/1188788 (owner: 10Majavah)
[14:03:36] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] "Good call @jmeybohm@wikimedia.org thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1188734 (https://phabricator.wikimedia.org/T404068) (owner: 10Brouberol)
[14:03:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[14:03:49] <logmsgbot>	 !log tgr@deploy1003 tgr: Continuing with sync
[14:03:59] <jinxer-wm>	 FIRING: [23x] CertAlmostExpired: Certificate for service asw1-b12-drmrs.mgmt.drmrs.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[14:06:11] <wikibugs>	 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11185242 (10Pcoombe) For fundraising banners we use the country from `mw.centralNotice.data.country` (which allows us to...
[14:06:39] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Rebalance s4 in eqiad (T403966)', diff saved to https://phabricator.wikimedia.org/P83375 and previous config saved to /var/cache/conftool/dbconfig/20250916-140638-ladsgroup.json
[14:06:44] <stashbot>	 T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966
[14:09:13] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1017.eqiad.wmnet with reason: host reimage
[14:09:15] <logmsgbot>	 !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1186593|Enable JWT session cookies on testwiki and beta (T399631)]] (duration: 17m 04s)
[14:09:20] <stashbot>	 T399631: Deploy JWT cookies to production - https://phabricator.wikimedia.org/T399631
[14:09:39] <wikibugs>	 (03PS7) 10Scott French: hieradata: migrate parsoidtest1001 to 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1184119 (https://phabricator.wikimedia.org/T403772)
[14:10:03] <tgr>	 !log UTC afternoon deploys done
[14:10:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:36] <wikibugs>	 06SRE, 06FR-donorrelations, 06Infrastructure-Foundations, 10Mail: Donations@ doesn't forward to donate@ - https://phabricator.wikimedia.org/T403986#11185266 (10Aklapper) a:05AMJohnson→03DSeyfert_WMF
[14:11:20] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] hieradata: migrate parsoidtest1001 to 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1184119 (https://phabricator.wikimedia.org/T403772) (owner: 10Scott French)
[14:11:39] <wikibugs>	 (03PS2) 10Stevemunene: dse-k8s:Enable CSI and the Ceph CSI plugin on dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188754 (https://phabricator.wikimedia.org/T404576)
[14:13:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:13:58] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1017.eqiad.wmnet with reason: host reimage
[14:16:03] <wikibugs>	 (03CR) 10Scott French: [C:03+2] hieradata: migrate parsoidtest1001 to 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1184119 (https://phabricator.wikimedia.org/T403772) (owner: 10Scott French)
[14:18:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:18:59] <jinxer-wm>	 FIRING: [27x] CertAlmostExpired: Certificate for service asw1-b12-drmrs.mgmt.drmrs.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[14:19:13] <wikibugs>	 (03CR) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins)
[14:19:55] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11185300 (10phaultfinder)
[14:23:37] <wikibugs>	 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11185307 (10AKanji-WMF) @XenoRyet and I discussed getting this into our next Sprint as a stretch.
[14:27:26] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173396 (https://phabricator.wikimedia.org/T400048) (owner: 10Cyndywikime)
[14:29:09] <wikibugs>	 (03CR) 10Michael Große: [C:03+1] beta(Growth,MetricsPlatform): add notification experiment config and enable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173396 (https://phabricator.wikimedia.org/T400048) (owner: 10Cyndywikime)
[14:29:11] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:30:33] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1017.eqiad.wmnet with OS bookworm
[14:30:44] <wikibugs>	 (03PS3) 10Sergio Gimeno: beta(Growth,MetricsPlatform): add notification experiment config and enable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173396 (https://phabricator.wikimedia.org/T400048) (owner: 10Cyndywikime)
[14:32:18] <wikibugs>	 (03PS1) 10Arnaudb: Revert^3 "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188796
[14:33:30] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add inline patterns [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1188797
[14:34:11] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Add inline patterns [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1188797 (owner: 10Giuseppe Lavagetto)
[14:36:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:37:02] <wikibugs>	 (03PS1) 10Arnaudb: Revert^4 "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188798
[14:37:27] <wikibugs>	 (03PS1) 10Sbisson: SpecialContribute: configure new page target [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188799 (https://phabricator.wikimedia.org/T327063)
[14:38:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] SpecialContribute: configure new page target [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188799 (https://phabricator.wikimedia.org/T327063) (owner: 10Sbisson)
[14:38:47] <logmsgbot>	 !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Add inline pattern support - oblivian@cumin1003"
[14:38:48] <logmsgbot>	 !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Add inline pattern support - oblivian@cumin1003
[14:39:08] <wikibugs>	 (03PS1) 10Muehlenhoff: imposm-initial-import: Fix check whether imposm is running [puppet] - 10https://gerrit.wikimedia.org/r/1188801 (https://phabricator.wikimedia.org/T381565)
[14:39:27] <wikibugs>	 (03PS2) 10Muehlenhoff: imposm-initial-import: Fix check whether imposm is running [puppet] - 10https://gerrit.wikimedia.org/r/1188801 (https://phabricator.wikimedia.org/T381565)
[14:39:34] <logmsgbot>	 !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Add inline pattern support - oblivian@cumin1003
[14:39:35] <logmsgbot>	 !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Add inline pattern support - oblivian@cumin1003"
[14:39:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11185397 (10phaultfinder)
[14:40:13] <moritzm>	 !log installing libsndfile security updates
[14:40:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:20] <wikibugs>	 (03CR) 10Samuel (WMF): [C:03+1] hCaptcha: Set wgHCaptchaApiUrlIntegrityHash and pin secure-api.js version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187079 (https://phabricator.wikimedia.org/T404251) (owner: 10Kosta Harlan)
[14:41:33] <wikibugs>	 (03PS1) 10Andrew Bogott: Update nic IDs for cloudcephosd1017 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1188805 (https://phabricator.wikimedia.org/T404249)
[14:42:30] <wikibugs>	 (03PS2) 10Sbisson: SpecialContribute: configure new page target [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188799 (https://phabricator.wikimedia.org/T327063)
[14:42:38] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Update nic IDs for cloudcephosd1017 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1188805 (https://phabricator.wikimedia.org/T404249) (owner: 10Andrew Bogott)
[14:43:40] <wikibugs>	 (03CR) 10CI reject: [V:04-1] SpecialContribute: configure new page target [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188799 (https://phabricator.wikimedia.org/T327063) (owner: 10Sbisson)
[14:46:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge::checker: Remove grid base profile [puppet] - 10https://gerrit.wikimedia.org/r/1188783 (https://phabricator.wikimedia.org/T314664) (owner: 10Majavah)
[14:46:29] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge::checker: Remove grid base profile [puppet] - 10https://gerrit.wikimedia.org/r/1188783 (https://phabricator.wikimedia.org/T314664) (owner: 10Majavah)
[14:48:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[14:48:48] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11185433 (10elukey) @Jhancock.wm Hi! When you have a moment could you please check if sretest2010 is in a weird state? I am not able to powercycle it..
[14:49:35] <logmsgbot>	 !log dancy@deploy1003 Started scap sync-world: Testing for T403882
[14:49:39] <stashbot>	 T403882: Wikidata N-Triples RDF dumps empty, broken since at least 25 July 2025 - https://phabricator.wikimedia.org/T403882
[14:49:47] <kostajh>	 jouncebot: nowandnext
[14:49:47] <jouncebot>	 For the next 0 hour(s) and 10 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1430)
[14:49:47] <jouncebot>	 In 0 hour(s) and 10 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1500)
[14:50:41] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:51:47] <wikibugs>	 (03PS5) 10Majavah: P:toolforge: Remove support for grid bastions [puppet] - 10https://gerrit.wikimedia.org/r/1012752 (https://phabricator.wikimedia.org/T314665)
[14:51:47] <wikibugs>	 (03PS3) 10Majavah: P:toolforge: Cleanup buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188784
[14:51:47] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: Delete cmd_checklist test suite [puppet] - 10https://gerrit.wikimedia.org/r/1188807
[14:52:39] <wikibugs>	 (03PS3) 10Sbisson: SpecialContribute: configure new page target [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188799 (https://phabricator.wikimedia.org/T327063)
[14:52:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "spot-checked the most common entry paths, LGTM! feels-good-meme.png" [puppet] - 10https://gerrit.wikimedia.org/r/1012752 (https://phabricator.wikimedia.org/T314665) (owner: 10Majavah)
[14:53:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[14:54:27] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:55:29] <wikibugs>	 (03PS1) 10Muehlenhoff: Reset maps nodes for a fresh import [puppet] - 10https://gerrit.wikimedia.org/r/1188808
[14:57:21] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bookworm
[14:57:25] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1052.eqiad.wmnet with OS bookworm
[14:57:28] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bookworm
[14:58:49] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Reset maps nodes for a fresh import [puppet] - 10https://gerrit.wikimedia.org/r/1188808 (owner: 10Muehlenhoff)
[14:59:13] <wikibugs>	 (03PS2) 10Krinkle: Disable wmgUseMdotRouting on cawiki, hewiki, itwiki (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185120 (https://phabricator.wikimedia.org/T403510)
[14:59:17] <wikibugs>	 (03PS3) 10Krinkle: varnish: Enable unified mobile routing on cawiki, hewiki, itwiki (group1) [puppet] - 10https://gerrit.wikimedia.org/r/1185116 (https://phabricator.wikimedia.org/T403510)
[15:01:36] <logmsgbot>	 !log dancy@deploy1003 Finished scap sync-world: Testing for T403882 (duration: 12m 01s)
[15:01:40] <stashbot>	 T403882: Wikidata N-Triples RDF dumps empty, broken since at least 25 July 2025 - https://phabricator.wikimedia.org/T403882
[15:07:38] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudcephosd: add python3-packaging [puppet] - 10https://gerrit.wikimedia.org/r/1188810 (https://phabricator.wikimedia.org/T404249)
[15:08:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[15:09:00] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:09:24] <wikibugs>	 (03PS2) 10Andrew Bogott: cloudcephosd: add python3-packaging [puppet] - 10https://gerrit.wikimedia.org/r/1188810 (https://phabricator.wikimedia.org/T404249)
[15:09:28] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] phabricator: remove defunct ElasticSearch backend settings [puppet] - 10https://gerrit.wikimedia.org/r/1185884 (https://phabricator.wikimedia.org/T403948) (owner: 10Aklapper)
[15:09:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11185524 (10phaultfinder)
[15:10:14] <mutante>	 andre: no phorge deploy?
[15:10:33] <wikibugs>	 (03PS1) 10Brouberol: kubernetes: add service secrets for dse-k8s-eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1188811
[15:10:41] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] (api|rest)-gateway: set Server header if supplied by service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188758 (https://phabricator.wikimedia.org/T401396) (owner: 10Hnowlan)
[15:10:49] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] kubernetes: add service secrets for dse-k8s-eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1188811 (owner: 10Brouberol)
[15:10:51] <wikibugs>	 (03CR) 10Brouberol: [V:03+2 C:03+2] kubernetes: add service secrets for dse-k8s-eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1188811 (owner: 10Brouberol)
[15:10:57] <andre>	 mutante: not this week, need to test the upstream pull more
[15:11:55] <mutante>	 andre: ok, ACK! we are doing the puppet patch that removes elasticsearch config
[15:12:06] <mutante>	 had it planned for the window.. remember
[15:12:21] <mutante>	 or that was the suggestion.. so getting it out now
[15:12:46] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd: add python3-packaging [puppet] - 10https://gerrit.wikimedia.org/r/1188810 (https://phabricator.wikimedia.org/T404249) (owner: 10Andrew Bogott)
[15:13:23] <andre>	 mutante, argh, true. Sorry, I forgot that one
[15:13:33] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] feat: Allow communities to opt out experienced users from mentorship [extensions/GrowthExperiments] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188767 (https://phabricator.wikimedia.org/T403563) (owner: 10Urbanecm)
[15:14:08] <mutante>	 andre: arrr.. I realized this file is under hieradata/role/eqiad .. that is kind of bad
[15:14:41] <andre>	 mutante, feel free not to deploy and rethink the problem :)
[15:15:23] <mutante>	 andre: well.. 2 options here.. either stuff is duplicated for each DC or it needs a second patch for codfw
[15:15:32] <mutante>	 ok
[15:17:46] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "this only affects eqiad but the same thing also exists in codfw - needs another patch" [puppet] - 10https://gerrit.wikimedia.org/r/1185884 (https://phabricator.wikimedia.org/T403948) (owner: 10Aklapper)
[15:18:40] <wikibugs>	 (03PS1) 10Dzahn: phabricator: drop elasticsearch settings in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1188815 (https://phabricator.wikimedia.org/T403948)
[15:19:00] <jinxer-wm>	 FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[15:19:55] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1188815" [puppet] - 10https://gerrit.wikimedia.org/r/1185884 (https://phabricator.wikimedia.org/T403948) (owner: 10Aklapper)
[15:20:07] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] phabricator: drop elasticsearch settings in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1188815 (https://phabricator.wikimedia.org/T403948) (owner: 10Dzahn)
[15:20:40] <wikibugs>	 (03PS1) 10Brouberol: kubernetes: add service secrets for airflow-dev/dse-k8s-eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1188816
[15:20:56] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] kubernetes: add service secrets for airflow-dev/dse-k8s-eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1188816 (owner: 10Brouberol)
[15:21:02] <wikibugs>	 (03CR) 10Brouberol: [V:03+2 C:03+2] kubernetes: add service secrets for airflow-dev/dse-k8s-eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1188816 (owner: 10Brouberol)
[15:21:41] <jinxer-wm>	 FIRING: SystemdUnitFailed: logrotate.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:23:18] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[15:23:51] <wikibugs>	 (03Merged) 10jenkins-bot: feat: Allow communities to opt out experienced users from mentorship [extensions/GrowthExperiments] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188767 (https://phabricator.wikimedia.org/T403563) (owner: 10Urbanecm)
[15:24:09] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr1-codfw
[15:24:24] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-codfw
[15:24:56] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr2-codfw
[15:25:10] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-codfw
[15:25:23] <Dreamy_Jazz>	 jouncebot: nowandnext
[15:25:23] <jouncebot>	 For the next 0 hour(s) and 34 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1500)
[15:25:23] <jouncebot>	 In 0 hour(s) and 34 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1600)
[15:26:20] <wikibugs>	 10SRE-SLO, 10Charts, 06Reader Growth Team: Finalize Charts SLO - https://phabricator.wikimedia.org/T399613#11185583 (10elukey) Thanks a lot for the patience folks, we have stopped onboarding new SLOs in Pyrra temporarily while we figure out T403729. We are comparing the results with another tool in T404171,...
[15:26:29] <logmsgbot>	 !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1188767|feat: Allow communities to opt out experienced users from mentorship (T403563)]]
[15:26:32] <Dreamy_Jazz>	 Anyone mind if I deploy a security patch in this window?
[15:26:34] <stashbot>	 T403563: Do not automatically enroll experienced editors into Mentorship when they visit the Homepage - https://phabricator.wikimedia.org/T403563
[15:26:43] <Dreamy_Jazz>	 Oh it seems that someone started scap as I said that :D
[15:26:44] <urbanecm>	 Dreamy_Jazz: i am currently deploying sth, but no concerns once i'm done
[15:26:49] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device asw1-by27-esams
[15:26:57] <urbanecm>	 sorry! CI just finished, so it started.
[15:27:01] <Dreamy_Jazz>	 Np
[15:27:02] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-by27-esams
[15:27:08] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Function lookup() did not find a value for the na" [puppet] - 10https://gerrit.wikimedia.org/r/1185884 (https://phabricator.wikimedia.org/T403948) (owner: 10Aklapper)
[15:27:13] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Function lookup() did not find a value for the na" [puppet] - 10https://gerrit.wikimedia.org/r/1188815 (https://phabricator.wikimedia.org/T403948) (owner: 10Dzahn)
[15:28:01] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device asw1-b13-drmrs
[15:28:15] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b13-drmrs
[15:28:33] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge: Remove support for grid bastions [puppet] - 10https://gerrit.wikimedia.org/r/1012752 (https://phabricator.wikimedia.org/T314665) (owner: 10Majavah)
[15:28:59] <jinxer-wm>	 FIRING: [27x] CertAlmostExpired: Certificate for service asw1-b12-drmrs.mgmt.drmrs.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[15:29:08] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device asw1-b12-drmrs
[15:29:22] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b12-drmrs
[15:29:36] <mutante>	 andre: 3 different problems :)
[15:29:45] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr2-esams
[15:29:58] <andre>	 mutante, sorry, I did not see that can of worms coming and thought it's gonna be trivial :(
[15:30:08] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-esams
[15:30:19] <andre>	 if you want to turn that into a phab task feel free to I guess
[15:30:26] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge: Cleanup buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188784 (owner: 10Majavah)
[15:30:46] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device asw1-bw27-esams
[15:30:49] <mutante>	 andre: no blame! just sharing. I will leave comments
[15:30:59] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-bw27-esams
[15:31:07] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr2-drmrs
[15:31:25] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-drmrs
[15:31:30] <topranks>	 sorry for the spam with these cookbook runs for certs 
[15:32:28] <wikibugs>	 (03PS1) 10Dzahn: Revert "phabricator: remove defunct ElasticSearch backend settings" [puppet] - 10https://gerrit.wikimedia.org/r/1188820
[15:32:51] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr1-esams
[15:33:09] <wikibugs>	 (03PS1) 10Dzahn: Revert "phabricator: remove defunct ElasticSearch backend settings" [puppet] - 10https://gerrit.wikimedia.org/r/1188821
[15:33:16] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-esams
[15:33:18] <kostajh>	 jouncebot: nowandnext
[15:33:19] <jouncebot>	 For the next 0 hour(s) and 26 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1500)
[15:33:19] <jouncebot>	 In 0 hour(s) and 26 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1600)
[15:33:27] <kostajh>	 I'd like to deploy a MediaWiki patch 
[15:33:40] <wikibugs>	 (03Abandoned) 10Dzahn: Revert "phabricator: remove defunct ElasticSearch backend settings" [puppet] - 10https://gerrit.wikimedia.org/r/1188821 (owner: 10Dzahn)
[15:33:59] <jinxer-wm>	 FIRING: [27x] CertAlmostExpired: Certificate for service asw1-b12-drmrs.mgmt.drmrs.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[15:34:00] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:34:08] <Dreamy_Jazz>	 I'm in the queue to deploy a security patch
[15:34:09] <wikibugs>	 (03PS1) 10Dzahn: Revert "phabricator: drop elasticsearch settings in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1188822
[15:34:13] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Enable version pinning and subresource integrity [extensions/ConfirmEdit] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188823 (https://phabricator.wikimedia.org/T404251)
[15:34:19] <Dreamy_Jazz>	 There is already a scap backport happening
[15:34:23] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr1-drmrs
[15:34:29] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Enable version pinning and subresource integrity [extensions/ConfirmEdit] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1188824 (https://phabricator.wikimedia.org/T404251)
[15:34:41] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-drmrs
[15:34:46] <wikibugs>	 (03PS1) 10Scott French: shellbox: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188817 (https://phabricator.wikimedia.org/T403284)
[15:34:47] <wikibugs>	 (03PS1) 10Scott French: shellbox-timeline: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188818 (https://phabricator.wikimedia.org/T403284)
[15:34:48] <wikibugs>	 (03PS1) 10Scott French: shellbox-video: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188819 (https://phabricator.wikimedia.org/T403284)
[15:34:56] <logmsgbot>	 !log jhancock@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host mc-misc2001
[15:34:59] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr3-eqsin
[15:35:06] <logmsgbot>	 !log jhancock@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mc-misc2001
[15:35:24] <logmsgbot>	 !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:35:33] <kostajh>	 Dreamy_Jazz: ack, please ping me when you're done 
[15:35:41] <Dreamy_Jazz>	 Sure
[15:35:43] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr3-eqsin
[15:35:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[15:36:01] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr4-ulsfo
[15:36:02] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "a couple other things are needed here:" [puppet] - 10https://gerrit.wikimedia.org/r/1185884 (https://phabricator.wikimedia.org/T403948) (owner: 10Aklapper)
[15:36:10] <Dreamy_Jazz>	 urbanecm: Mind pinging me when you are done?
[15:36:13] <urbanecm>	 sure
[15:36:18] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr4-ulsfo
[15:36:32] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Revert "phabricator: remove defunct ElasticSearch backend settings" [puppet] - 10https://gerrit.wikimedia.org/r/1188820 (owner: 10Dzahn)
[15:36:45] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr3-ulsfo
[15:36:59] <wikibugs>	 10SRE-SLO, 10Charts, 06Reader Growth Team: Finalize Charts SLO - https://phabricator.wikimedia.org/T399613#11185664 (10CDanis) Luca, do you want an early test subject for the Sloth trial?
[15:37:07] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr3-ulsfo
[15:37:13] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Revert "phabricator: drop elasticsearch settings in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1188822 (owner: 10Dzahn)
[15:37:45] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-f3-eqiad
[15:37:51] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f3-eqiad
[15:38:07] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-f2-eqiad
[15:38:13] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f2-eqiad
[15:38:20] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-e3-eqiad
[15:38:25] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e3-eqiad
[15:38:31] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-e2-eqiad
[15:38:37] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e2-eqiad
[15:38:45] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-e1-eqiad
[15:38:50] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e1-eqiad
[15:39:00] <jinxer-wm>	 FIRING: [27x] CertAlmostExpired: Certificate for service asw1-b12-drmrs.mgmt.drmrs.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[15:39:07] <wikibugs>	 (03PS1) 10Majavah: backy2: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188825
[15:39:07] <wikibugs>	 (03PS1) 10Majavah: ceph: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188826
[15:39:07] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::checker: Remove absent checks [puppet] - 10https://gerrit.wikimedia.org/r/1188827
[15:39:08] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::metricsinfra: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188828
[15:39:08] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cloudsw1-f4-eqiad
[15:39:09] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::prometheus: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188829
[15:39:14] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw1-f4-eqiad
[15:39:20] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cloudsw1-e4-eqiad
[15:39:25] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw1-e4-eqiad
[15:39:37] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cloudsw1-b1-codfw
[15:39:46] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw1-b1-codfw
[15:39:56] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr1-eqiad
[15:39:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ceph: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188826 (owner: 10Majavah)
[15:40:08] <logmsgbot>	 !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:40:10] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-eqiad
[15:40:25] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cloudsw1-c8-eqiad
[15:40:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:wmcs::metricsinfra: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188828 (owner: 10Majavah)
[15:40:36] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw1-c8-eqiad
[15:40:49] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-f1-eqiad
[15:40:54] <wikibugs>	 (03PS2) 10Scott French: shellbox: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188817 (https://phabricator.wikimedia.org/T403284)
[15:40:54] <wikibugs>	 (03PS2) 10Scott French: shellbox-timeline: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188818 (https://phabricator.wikimedia.org/T403284)
[15:40:54] <wikibugs>	 (03PS2) 10Scott French: shellbox-video: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188819 (https://phabricator.wikimedia.org/T403284)
[15:40:55] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-eqiad
[15:41:05] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cloudsw1-d5-eqiad
[15:41:16] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw1-d5-eqiad
[15:41:25] <logmsgbot>	 !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:41:35] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr2-eqiad
[15:41:38] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage
[15:41:43] <wikibugs>	 (03PS2) 10Majavah: ceph: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188826
[15:41:49] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-eqiad
[15:42:28] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS1 Status - issue on wikikube-worker2324:9290 - https://phabricator.wikimedia.org/T404480#11185695 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[15:42:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ceph: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188826 (owner: 10Majavah)
[15:42:42] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage
[15:42:43] <wikibugs>	 (03CR) 10Majavah: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1188826 (owner: 10Majavah)
[15:44:01] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] shellbox: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188817 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French)
[15:44:08] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] shellbox-timeline: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188818 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French)
[15:44:14] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] shellbox-video: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188819 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French)
[15:44:27] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1052.eqiad.wmnet with reason: host reimage
[15:45:02] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage
[15:45:25] <logmsgbot>	 jhancock@cumin1002 provision (PID 1127341) is awaiting input
[15:45:34] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1188807 (owner: 10Majavah)
[15:45:41] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge: Delete cmd_checklist test suite [puppet] - 10https://gerrit.wikimedia.org/r/1188807 (owner: 10Majavah)
[15:47:09] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11185731 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[15:48:32] <Dreamy_Jazz>	 urbanecm: I guess this is still going because it modified i18n?
[15:48:39] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Nice, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1186649 (owner: 10Andrew Bogott)
[15:48:55] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1052.eqiad.wmnet with reason: host reimage
[15:49:34] <Dreamy_Jazz>	 I need to go, so kostajh you've moved forward in the queue
[15:49:51] <Dreamy_Jazz>	 I'll leave the security patch till later
[15:51:04] <wikibugs>	 10SRE-SLO, 10Charts, 06Reader Growth Team: Finalize Charts SLO - https://phabricator.wikimedia.org/T399613#11185773 (10elukey) >>! In T399613#11185664, @CDanis wrote: > Luca, do you want an early test subject for the Sloth trial?  Definitely, the first use case will be Citoid so we can make a comparison with...
[15:52:25] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage
[15:54:37] <urbanecm>	 Dreamy_Jazz: likely
[15:55:18] <logmsgbot>	 !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:58:00] <wikibugs>	 (03CR) 10Btullis: "The values themselves look good, but you haven't enabled the installation for the dse-k8s-codfw cluster." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188754 (https://phabricator.wikimedia.org/T404576) (owner: 10Stevemunene)
[16:00:05] <jouncebot>	 jhathaway and moritzm: Your horoscope predicts another Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1600).
[16:00:05] <jouncebot>	 zabe: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:31] <zabe>	 o/
[16:00:59] <jhathaway>	 o/
[16:01:17] <kostajh>	 there are some MW patches going out currently, as a heads up ^
[16:02:36] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bookworm
[16:02:49] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] Add Apache configuration for Wikimedia Thailand wiki [puppet] - 10https://gerrit.wikimedia.org/r/1187539 (https://phabricator.wikimedia.org/T400001) (owner: 10Zabe)
[16:04:09] <logmsgbot>	 !log urbanecm@deploy1003 sync-world failed: <CalledProcessError> Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.45.0-wmf.17,1.45.0-wmf.18,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted/
[16:04:09] <logmsgbot>	 mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.210.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/medi
[16:04:09] <logmsgbot>	 awiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.210.0) (duration: 37m 39s)
[16:04:22] <urbanecm>	 what the hell?
[16:04:22] <jhathaway>	 zabe: patch merged
[16:04:33] <zabe>	 jhathaway: thx :)
[16:05:03] <urbanecm>	 jhathaway: might the merge interfere with the scap (that was running from before)? or is that unrelated?
[16:06:10] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1052.eqiad.wmnet with OS bookworm
[16:06:40] <jhathaway>	 urbanecm: not sure
[16:06:56] <cdanis>	 urbanecm: I think it would have to be that the patch got merged *and* puppet ran on the deployment host, for there to be any possible effect
[16:07:07] <urbanecm>	 fair
[16:07:14] <urbanecm>	 this seems to be the key part of the log https://www.irccloud.com/pastebin/6OvUSnro/
[16:07:42] <cdanis>	 on deploy1003 it last finished at 15:54, well before the +2
[16:07:49] <urbanecm>	 yeah
[16:08:15] <cdanis>	 soooo
[16:08:40] <cdanis>	 it did renew some certificates, for mw-experimental / mw-experimental-deploy
[16:09:03] <cdanis>	 I don't know if scap uses those?
[16:09:21] <urbanecm>	 it seems pushing to docker-registry failed
[16:09:32] <cdanis>	 I don't think that would have used those certs
[16:09:46] <jinxer-wm>	 FIRING: Emergency syslog message: Alert for device pfw1-codfw.wikimedia.org - Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[16:09:46] <jinxer-wm>	 FIRING: Emergency syslog message: Alert for device pfw1-codfw.wikimedia.org - Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[16:09:55] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bookworm
[16:11:04] <urbanecm>	 i can also try again and hope the push'll work on second try :-/
[16:12:22] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11185881 (10RobH) I overthought this, we should just move them with an SFP-T to the new port and worry about reimage and migration to full 10G later.
[16:13:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11185897 (10RobH)
[16:13:42] <urbanecm>	 cdanis: any objections to that? or do you want to look into where this occured?
[16:14:22] <cdanis>	 urandom: no objections
[16:14:40] <cdanis>	 nothing jumping out at me in https://grafana.wikimedia.org/d/StcefURWz/docker-registry?orgId=1&from=now-3h&to=now&timezone=utc&var-datasource=000000006&var-instance=$__all either
[16:15:06] <urbanecm>	 ack, restarting
[16:15:39] <logmsgbot>	 !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1188767|feat: Allow communities to opt out experienced users from mentorship (T403563)]]
[16:15:43] <stashbot>	 T403563: Do not automatically enroll experienced editors into Mentorship when they visit the Homepage - https://phabricator.wikimedia.org/T403563