[00:03:11] (03PS2) 10RLazarus: deployment_server: Add a script for mass-deploying helmfile services [puppet] - 10https://gerrit.wikimedia.org/r/1188456 (https://phabricator.wikimedia.org/T380211) [00:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:04:38] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/ipoid: apply [00:04:46] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/ipoid: apply [00:05:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:08:05] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1188478 [00:08:06] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1188478 (owner: 10TrainBranchBot) [00:08:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:08:50] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/kartotherian: apply [00:08:59] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/kartotherian: apply [00:09:14] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply [00:11:50] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [00:13:23] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/media-analytics: apply [00:13:35] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [00:13:53] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [00:13:59] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:17:15] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [00:19:46] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [00:19:50] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [00:20:31] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/page-analytics: apply [00:20:40] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [00:20:52] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/proton: apply [00:21:00] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: apply [00:21:11] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/push-notifications: apply [00:21:35] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [00:22:06] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [00:22:10] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [00:22:19] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/recommendation-api: apply [00:22:27] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/recommendation-api: apply [00:22:36] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/sessionstore: apply [00:22:52] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [00:23:08] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply [00:23:13] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply [00:23:31] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [00:23:35] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [00:23:53] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [00:23:57] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [00:24:06] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply [00:24:10] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [00:24:21] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [00:24:51] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply [00:28:05] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1188478 (owner: 10TrainBranchBot) [00:28:23] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/toolhub: apply [00:28:31] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/toolhub: apply [00:29:20] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [00:29:35] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [00:29:45] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/zotero: apply [00:29:58] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/zotero: apply [00:42:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:52:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:57:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [01:08:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.19 [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1188486 (https://phabricator.wikimedia.org/T396380) [01:08:06] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.19 [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1188486 (https://phabricator.wikimedia.org/T396380) (owner: 10TrainBranchBot) [01:23:32] (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.19 [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1188486 (https://phabricator.wikimedia.org/T396380) (owner: 10TrainBranchBot) [01:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:33:59] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:44:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183385 (10phaultfinder) [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T0200) [02:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:13:59] FIRING: [14x] CertAlmostExpired: Certificate for service cloudsw1-b1-codfw.mgmt.codfw.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:22:08] (03PS1) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 [02:22:38] (03CR) 10CI reject: [V:04-1] varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 (owner: 10Krinkle) [02:23:31] (03PS2) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 [02:23:57] (03CR) 10CI reject: [V:04-1] varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 (owner: 10Krinkle) [02:24:57] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183413 (10phaultfinder) [02:26:01] (03PS3) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 [02:28:54] (03PS4) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 [02:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:30:06] (03PS5) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 [02:31:04] (03PS6) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 [02:31:23] (03PS7) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 [02:32:26] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11183419 (10Papaul) @cmooney we have the spare PEM on site. I need to get on a call with Juniper to troubleshooting this. Do you think Thursd... [02:34:08] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:34:08] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:35:17] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#11183420 (10Papaul) [02:36:29] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#11183421 (10Papaul) 05Open→03Resolved a:03Papaul The BIO reader is installed now and working. so closing this task [02:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:37:15] (03PS1) 10Superpes15: Throttle exemption for Editathon by Wikimedistas en Cruce - 26 September 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188493 (https://phabricator.wikimedia.org/T404592) [02:39:28] (03PS2) 10Superpes15: Throttle exemption for Editathon by Wikimedistas en Cruce - 26 September 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188493 (https://phabricator.wikimedia.org/T404592) [02:43:58] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:43:58] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:50:10] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183429 (10phaultfinder) [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T0300) [03:23:59] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:24:57] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183455 (10phaultfinder) [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T0400) [04:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:04:18] !log mwpresync@deploy1003 Pruned MediaWiki: 1.45.0-wmf.16 (duration: 04m 08s) [04:05:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [04:13:59] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:24:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183464 (10phaultfinder) [04:48:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:53:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:02:43] FIRING: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:02:53] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:02:58] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:03:08] FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:07:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:08:59] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:17:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:25:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:27:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:30:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:32:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:32:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:33:59] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:34:00] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:37:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:38:59] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:46:39] (03PS1) 10Huei Tan: xLab: Update the PageVisit target wiki for MinT readers [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188509 (https://phabricator.wikimedia.org/T404420) [05:47:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188509 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan) [05:47:20] (03Restored) 10Huei Tan: XLab\ResourceLoader\Hooks: Add stream to XLAB_STREAMS [extensions/MetricsPlatform] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188398 (owner: 10Huei Tan) [05:47:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188398 (owner: 10Huei Tan) [05:48:54] Hi, i have 2 patches for later backport, Kartik is not available, can you someone help with the deployment? [05:54:57] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183555 (10phaultfinder) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T0600). [06:13:59] FIRING: [14x] CertAlmostExpired: Certificate for service cloudsw1-b1-codfw.mgmt.codfw.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:47:14] (03CR) 10Muehlenhoff: [C:03+1] "LGTM, but this changes an existing sudo rule, so needs SRE IF meeting approval" [puppet] - 10https://gerrit.wikimedia.org/r/1188408 (https://phabricator.wikimedia.org/T404630) (owner: 10CDanis) [06:52:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:59:56] I can deploy these backports. [07:00:00] o/ [07:00:03] thanks [07:00:04] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T0700). nyaa~ [07:00:04] hueitan: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188509 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan) [07:03:14] (03Merged) 10jenkins-bot: xLab: Update the PageVisit target wiki for MinT readers [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188509 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan) [07:03:40] !log awight@deploy1003 Started scap sync-world: Backport for [[gerrit:1188509|xLab: Update the PageVisit target wiki for MinT readers (T404420)]] [07:03:45] T404420: Enable 13 wikis for MinT for Wiki Readers A/A test - https://phabricator.wikimedia.org/T404420 [07:07:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:09:20] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:09:40] !log awight@deploy1003 awight, hueitan: Backport for [[gerrit:1188509|xLab: Update the PageVisit target wiki for MinT readers (T404420)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:09:45] T404420: Enable 13 wikis for MinT for Wiki Readers A/A test - https://phabricator.wikimedia.org/T404420 [07:11:39] hueitan: Please check on mwdebug [07:11:48] awight: thanks for the deployments! :] [07:12:33] awight checked, see it live now on mwdebug [07:12:43] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:12:43] FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:12:45] hashar: My pleasure—Spider Pig has not let me down [07:12:51] hueitan: ty [07:12:58] !log awight@deploy1003 awight, hueitan: Continuing with sync [07:13:05] awight: yeah it is quite rad! Maybe one day we will have an equivalent to run Quibble from a web interface! :b [07:13:13] the bacula alert will get fixed soon [07:15:03] (03PS2) 10Slyngshede: Bump CAS container to 7.2.2 [software/bitu] - 10https://gerrit.wikimedia.org/r/1151178 [07:15:09] (03CR) 10Slyngshede: [C:03+2] Bump CAS container to 7.2.2 [software/bitu] - 10https://gerrit.wikimedia.org/r/1151178 (owner: 10Slyngshede) [07:18:15] (03Merged) 10jenkins-bot: Bump CAS container to 7.2.2 [software/bitu] - 10https://gerrit.wikimedia.org/r/1151178 (owner: 10Slyngshede) [07:18:16] !log awight@deploy1003 Finished scap sync-world: Backport for [[gerrit:1188509|xLab: Update the PageVisit target wiki for MinT readers (T404420)]] (duration: 14m 35s) [07:18:20] T404420: Enable 13 wikis for MinT for Wiki Readers A/A test - https://phabricator.wikimedia.org/T404420 [07:18:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy1003 using scap backport" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188398 (owner: 10Huei Tan) [07:18:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-ulsfo and Arelion (2001:2035:0:a9a::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [07:18:45] Finished. On to the second patch... [07:18:59] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:19:21] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:21:04] (03Merged) 10jenkins-bot: XLab\ResourceLoader\Hooks: Add stream to XLAB_STREAMS [extensions/MetricsPlatform] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188398 (owner: 10Huei Tan) [07:21:21] !log awight@deploy1003 Started scap sync-world: Backport for [[gerrit:1188398|XLab\ResourceLoader\Hooks: Add stream to XLAB_STREAMS]] [07:21:25] FIRING: SystemdUnitFailed: logrotate.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:21:59] (03CR) 10Arnaudb: [C:03+2] mailman: add a local disk cache [puppet] - 10https://gerrit.wikimedia.org/r/1188320 (https://phabricator.wikimedia.org/T353891) (owner: 10Arnaudb) [07:25:47] hueitan: is this one testable? [07:25:55] let me check [07:26:09] maybe I kafkacat or... [07:26:43] hueitan: sorry, it's not quite ready to test yet [07:27:01] I was confusingly asking ahead of time [07:27:40] !log awight@deploy1003 hueitan, awight: Backport for [[gerrit:1188398|XLab\ResourceLoader\Hooks: Add stream to XLAB_STREAMS]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:27:43] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:27:43] RESOLVED: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:27:58] FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:28:02] PROBLEM - mailman3-web on lists1004 is CRITICAL: PROCS CRITICAL: 14 processes with UID = 33 (www-data), regex args /usr/bin/uwsgi https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:28:11] awight i see it live now [07:28:44] !log awight@deploy1003 hueitan, awight: Continuing with sync [07:28:47] hueitan: ack [07:28:55] Thank you, all good [07:30:02] RECOVERY - mailman3-web on lists1004 is OK: PROCS OK: 13 processes with UID = 33 (www-data), regex args /usr/bin/uwsgi https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:30:53] (03PS1) 10Arnaudb: Revert "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188696 [07:31:34] (03CR) 10Jelto: [C:03+1] Revert "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188696 (owner: 10Arnaudb) [07:32:33] (03CR) 10Arnaudb: [C:03+2] Revert "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188696 (owner: 10Arnaudb) [07:32:43] FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:32:48] FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:33:02] PROBLEM - mailman3-web on lists1004 is CRITICAL: PROCS CRITICAL: 14 processes with UID = 33 (www-data), regex args /usr/bin/uwsgi https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:33:36] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/2 (Transit: Arelion (IC-308844) {#1071}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:34:31] !log awight@deploy1003 Finished scap sync-world: Backport for [[gerrit:1188398|XLab\ResourceLoader\Hooks: Add stream to XLAB_STREAMS]] (duration: 13m 10s) [07:35:02] RECOVERY - mailman3-web on lists1004 is OK: PROCS OK: 13 processes with UID = 33 (www-data), regex args /usr/bin/uwsgi https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:35:20] 💯 [07:35:38] !log UTC morning deployments finished [07:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:48] (03PS1) 10Arnaudb: Revert^2 "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188708 [07:35:49] hueitan: Thanks for the help :-) [07:36:06] awight thank you [07:37:43] FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:39:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/2 (Transit: Arelion (IC-308844) {#1071}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:42:43] FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:44:06] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/2 (Transit: Arelion (IC-308844) {#1071}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:46:51] (03CR) 10Fabfur: [C:03+2] profile:cache:haproxy: copy utf8ps lua converter on cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1188366 (https://phabricator.wikimedia.org/T401383) (owner: 10Fabfur) [07:47:43] FIRING: [17x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:47:43] FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:52:38] (03PS1) 10Brouberol: mediawiki-dumps-legacy: add missing client_config_file config in addschange config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188712 (https://phabricator.wikimedia.org/T404162) [07:52:43] RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:52:48] FIRING: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:52:53] FIRING: [15x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:53:36] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:57:43] FIRING: [10x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:57:48] RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [08:02:43] RESOLVED: [8x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [08:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:13:59] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:22:04] PROBLEM - Kafka broker TLS certificate validity on kafka-test1008 is CRITICAL: SSL CRITICAL - Certificate kafka-test1008.eqiad.wmnet valid until 2025-09-23 08:22:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [08:22:22] PROBLEM - Kafka broker TLS certificate validity on kafka-test1006 is CRITICAL: SSL CRITICAL - Certificate kafka-test1006.eqiad.wmnet valid until 2025-09-23 08:22:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [08:22:22] PROBLEM - Kafka broker TLS certificate validity on kafka-test1010 is CRITICAL: SSL CRITICAL - Certificate kafka-test1010.eqiad.wmnet valid until 2025-09-23 08:22:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [08:23:22] PROBLEM - Kafka broker TLS certificate validity on kafka-test1007 is CRITICAL: SSL CRITICAL - Certificate kafka-test1007.eqiad.wmnet valid until 2025-09-23 08:23:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [08:24:06] PROBLEM - Kafka broker TLS certificate validity on kafka-main2006 is CRITICAL: SSL CRITICAL - Certificate kafka-main2006.codfw.wmnet valid until 2025-09-23 08:24:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [08:24:48] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: add missing client_config_file config in addschange config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188712 (https://phabricator.wikimedia.org/T404162) (owner: 10Brouberol) [08:25:04] PROBLEM - Kafka broker TLS certificate validity on kafka-test1009 is CRITICAL: SSL CRITICAL - Certificate kafka-test1009.eqiad.wmnet valid until 2025-09-23 08:25:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [08:26:59] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: add missing client_config_file config in addschange config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188712 (https://phabricator.wikimedia.org/T404162) (owner: 10Brouberol) [08:27:06] PROBLEM - Kafka broker TLS certificate validity on kafka-main2007 is CRITICAL: SSL CRITICAL - Certificate kafka-main2007.codfw.wmnet valid until 2025-09-23 08:27:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [08:28:04] PROBLEM - Kafka broker TLS certificate validity on kafka-main2009 is CRITICAL: SSL CRITICAL - Certificate kafka-main2009.codfw.wmnet valid until 2025-09-23 08:28:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [08:31:01] (03CR) 10Elukey: [C:03+1] Apply replica role to maps1012-1014 [puppet] - 10https://gerrit.wikimedia.org/r/1188308 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:31:06] PROBLEM - Kafka broker TLS certificate validity on kafka-main2008 is CRITICAL: SSL CRITICAL - Certificate kafka-main2008.codfw.wmnet valid until 2025-09-23 08:31:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [08:32:03] (03CR) 10Elukey: [C:03+2] spicerack/mysql.py: update CORE_SECTIONS to reflect newly added x3 section [software/spicerack] - 10https://gerrit.wikimedia.org/r/1187871 (https://phabricator.wikimedia.org/T404464) (owner: 10Jasmine) [08:34:53] 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for hueitan - https://phabricator.wikimedia.org/T404681 (10hueitan) 03NEW [08:35:54] (03PS1) 10Gergő Tisza: User: Simplify makeUpdateConditions() [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188715 (https://phabricator.wikimedia.org/T401748) [08:36:38] (03PS1) 10Gergő Tisza: session: Add a mechanism for forcing a refresh [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188716 (https://phabricator.wikimedia.org/T399200) [08:37:06] (03PS1) 10Gergő Tisza: Use short expiry for JWT cookies [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188717 (https://phabricator.wikimedia.org/T399200) [08:37:24] PROBLEM - Kafka broker TLS certificate validity on kafka-main2010 is CRITICAL: SSL CRITICAL - Certificate kafka-main2010.codfw.wmnet valid until 2025-09-23 08:37:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [08:38:10] (03PS1) 10Gergő Tisza: tests: Update for SessionCookieJwtExpiration added in core [extensions/CentralAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188718 (https://phabricator.wikimedia.org/T399200) [08:38:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188715 (https://phabricator.wikimedia.org/T401748) (owner: 10Gergő Tisza) [08:38:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188716 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza) [08:38:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188717 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza) [08:39:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CentralAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188718 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza) [08:41:33] (03Merged) 10jenkins-bot: spicerack/mysql.py: update CORE_SECTIONS to reflect newly added x3 section [software/spicerack] - 10https://gerrit.wikimedia.org/r/1187871 (https://phabricator.wikimedia.org/T404464) (owner: 10Jasmine) [08:42:02] (03PS1) 10Slyngshede: Permissions: Prevent duplicate permission requests [software/bitu] - 10https://gerrit.wikimedia.org/r/1188719 (https://phabricator.wikimedia.org/T403691) [08:45:12] (03PS2) 10Gergő Tisza: Enable JWT session cookies on testwiki and beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186593 (https://phabricator.wikimedia.org/T399631) [08:45:17] (03CR) 10Gergő Tisza: Enable JWT session cookies on testwiki and beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186593 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [08:45:38] (03PS1) 10Elukey: CHANGELOG: add changelogs for release v11.7.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1188721 [08:45:47] (03CR) 10CI reject: [V:04-1] Enable JWT session cookies on testwiki and beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186593 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [08:46:32] (03PS2) 10Slyngshede: Permissions: Prevent duplicate permission requests [software/bitu] - 10https://gerrit.wikimedia.org/r/1188719 (https://phabricator.wikimedia.org/T403691) [08:46:43] (03CR) 10CI reject: [V:04-1] tests: Update for SessionCookieJwtExpiration added in core [extensions/CentralAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188718 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza) [08:47:01] (03PS3) 10Gergő Tisza: Enable JWT session cookies on testwiki and beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186593 (https://phabricator.wikimedia.org/T399631) [08:47:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186593 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [08:48:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:53:39] FIRING: [4x] TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (187.16.221.197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [08:53:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:57:37] (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v11.7.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1188721 (owner: 10Elukey) [08:58:39] FIRING: [4x] TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (187.16.221.197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [08:58:49] (03PS3) 10Effie Mouzeli: P:hcaptcha: add temporary redirect [puppet] - 10https://gerrit.wikimedia.org/r/1188380 (https://phabricator.wikimedia.org/T404251) [08:59:25] 10ops-eqiad, 06SRE, 06DC-Ops: Decommission E/F 8 Dell switches - https://phabricator.wikimedia.org/T380050#11183855 (10cmooney) >>! In T380050#10654652, @BCornwall wrote: > Re: https://gerrit.wikimedia.org/r/c/operations/dns/+/1091711/comments/5e6962e8_b88980ce - Do the IPs need to be deleted from netbox? Y... [09:00:00] (03CR) 10Effie Mouzeli: [C:04-1] "@kosta, please provide where we define the version, so to add it in the comments and move forward, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1188380 (https://phabricator.wikimedia.org/T404251) (owner: 10Effie Mouzeli) [09:00:06] (03PS1) 10Elukey: Upstream release v11.7.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1188728 [09:00:18] (03CR) 10Effie Mouzeli: [C:04-1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188380 (https://phabricator.wikimedia.org/T404251) (owner: 10Effie Mouzeli) [09:01:16] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [09:03:01] (03PS2) 10Arnaudb: Revert^2 "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188708 (https://phabricator.wikimedia.org/T353891) [09:03:39] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188708 (https://phabricator.wikimedia.org/T353891) (owner: 10Arnaudb) [09:04:43] (03PS4) 10Effie Mouzeli: P:hcaptcha: add temporary redirect [puppet] - 10https://gerrit.wikimedia.org/r/1188380 (https://phabricator.wikimedia.org/T404251) [09:05:08] (03CR) 10Effie Mouzeli: "variable is $wgHCaptchaApiUrl" [puppet] - 10https://gerrit.wikimedia.org/r/1188380 (https://phabricator.wikimedia.org/T404251) (owner: 10Effie Mouzeli) [09:05:24] (03PS3) 10Arnaudb: Revert^2 "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188708 (https://phabricator.wikimedia.org/T353891) [09:05:40] (03CR) 10Elukey: [V:03+2 C:03+2] Upstream release v11.7.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1188728 (owner: 10Elukey) [09:06:14] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188708 (https://phabricator.wikimedia.org/T353891) (owner: 10Arnaudb) [09:06:28] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188708 (https://phabricator.wikimedia.org/T353891) (owner: 10Arnaudb) [09:06:34] (03PS1) 10Alexandros Kosiaris: deploy: Set HELM_DIFF_OUTPUT_CONTEXT=5 in kube_env.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188731 [09:08:17] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [09:08:18] (03PS2) 10Effie Mouzeli: P:hcaptcha: add keepalive_timeout [puppet] - 10https://gerrit.wikimedia.org/r/1187828 [09:08:39] RESOLVED: [4x] TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (187.16.221.197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [09:08:52] (03CR) 10Arnaudb: [C:03+2] Revert^2 "mailman: add a local disk cache" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1188708 (https://phabricator.wikimedia.org/T353891) (owner: 10Arnaudb) [09:09:33] (03CR) 10Jelto: [C:03+1] Revert^2 "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188708 (https://phabricator.wikimedia.org/T353891) (owner: 10Arnaudb) [09:11:23] (03CR) 10Elukey: [C:03+1] deploy: Set HELM_DIFF_OUTPUT_CONTEXT=5 in kube_env.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188731 (owner: 10Alexandros Kosiaris) [09:12:03] (03CR) 10Effie Mouzeli: P:hcaptcha: add keepalive_timeout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187828 (owner: 10Effie Mouzeli) [09:12:13] (03PS1) 10Arnaudb: Revert^3 "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188732 [09:12:57] (03PS3) 10Effie Mouzeli: P:hcaptcha: add keepalive_timeout [puppet] - 10https://gerrit.wikimedia.org/r/1187828 (https://phabricator.wikimedia.org/T403416) [09:12:57] (03CR) 10Arnaudb: [C:03+2] Revert^3 "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188732 (owner: 10Arnaudb) [09:13:27] (03CR) 10Effie Mouzeli: P:hcaptcha: add keepalive_timeout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187828 (https://phabricator.wikimedia.org/T403416) (owner: 10Effie Mouzeli) [09:23:50] jouncebot: nowandnext [09:23:50] No deployments scheduled for the next 0 hour(s) and 36 minute(s) [09:23:50] In 0 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1000) [09:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:24:31] (03CR) 10Hnowlan: [C:03+1] P:hcaptcha: add temporary redirect [puppet] - 10https://gerrit.wikimedia.org/r/1188380 (https://phabricator.wikimedia.org/T404251) (owner: 10Effie Mouzeli) [09:24:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:24:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183943 (10phaultfinder) [09:25:04] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:25:22] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:27:05] !log uploaded spicerack_11.7.0 to apt.wikimedia.org bullseye-wikimedia,bookworm-wikimedia [09:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:13] (03CR) 10Giuseppe Lavagetto: [C:03+1] "we alrready do this in scap IIRC." [puppet] - 10https://gerrit.wikimedia.org/r/1188731 (owner: 10Alexandros Kosiaris) [09:30:40] (03CR) 10Slyngshede: [C:03+1] "LGTM. Tested on in local environment." [puppet] - 10https://gerrit.wikimedia.org/r/1188379 (https://phabricator.wikimedia.org/T401383) (owner: 10Fabfur) [09:31:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188352 (https://phabricator.wikimedia.org/T404594) (owner: 10Dreamy Jazz) [09:31:30] !log elukey@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-test-eqiad [09:31:59] (03Merged) 10jenkins-bot: Document that test2wiki has suggested investigations DB tables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188352 (https://phabricator.wikimedia.org/T404594) (owner: 10Dreamy Jazz) [09:32:15] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1188352|Document that test2wiki has suggested investigations DB tables (T404594)]] [09:32:19] T404594: Create suggested investigation database tables on test2wiki - https://phabricator.wikimedia.org/T404594 [09:32:54] RECOVERY - Kafka broker TLS certificate validity on kafka-test1006 is OK: SSL OK - Certificate kafka-test1006.eqiad.wmnet valid until 2026-08-23 08:32:00 +0000 (expires in 340 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [09:34:00] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:34:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:36:27] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1012.eqiad.wmnet with OS trixie [09:38:19] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:puppetserver::volatile avoid loading Spur data on certain host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184646 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede) [09:38:20] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1188352|Document that test2wiki has suggested investigations DB tables (T404594)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:38:25] T404594: Create suggested investigation database tables on test2wiki - https://phabricator.wikimedia.org/T404594 [09:38:42] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [09:39:50] RECOVERY - Kafka broker TLS certificate validity on kafka-test1007 is OK: SSL OK - Certificate kafka-test1007.eqiad.wmnet valid until 2026-08-23 08:34:00 +0000 (expires in 340 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [09:41:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:44:04] (03PS2) 10Gergő Tisza: tests: Update for SessionCookieJwtExpiration added in core [extensions/CentralAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188718 (https://phabricator.wikimedia.org/T399200) [09:44:09] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1188352|Document that test2wiki has suggested investigations DB tables (T404594)]] (duration: 11m 54s) [09:44:14] T404594: Create suggested investigation database tables on test2wiki - https://phabricator.wikimedia.org/T404594 [09:45:37] (03CR) 10Fabfur: "Thanks, another test is always helpful!" [puppet] - 10https://gerrit.wikimedia.org/r/1188379 (https://phabricator.wikimedia.org/T401383) (owner: 10Fabfur) [09:46:02] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2229.codfw.wmnet with reason: Maintenance [09:46:09] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2229 (T402925)', diff saved to https://phabricator.wikimedia.org/P83353 and previous config saved to /var/cache/conftool/dbconfig/20250916-094609-ladsgroup.json [09:46:14] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [09:46:25] RECOVERY - Kafka broker TLS certificate validity on kafka-test1008 is OK: SSL OK - Certificate kafka-test1008.eqiad.wmnet valid until 2026-08-23 08:21:00 +0000 (expires in 340 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [09:47:58] !log disable puppet on A:cp to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/1188379 (T401383) [09:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:02] T401383: Reduce noise from duplicate sequence-gap alerts on HaProxy-webrequests - https://phabricator.wikimedia.org/T401383 [09:48:39] (03PS2) 10Federico Ceratto: es2049.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1185879 (https://phabricator.wikimedia.org/T402859) [09:48:39] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2220.codfw.wmnet with reason: Maintenance [09:48:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2220 (T402925)', diff saved to https://phabricator.wikimedia.org/P83354 and previous config saved to /var/cache/conftool/dbconfig/20250916-094846-ladsgroup.json [09:49:48] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS bookworm [09:50:30] 06SRE: Allow Puppet to pull in XCHEESESCORE git repo - https://phabricator.wikimedia.org/T404688 (10SLyngshede-WMF) 03NEW [09:50:48] 06SRE: Allow Puppet to pull in XCHEESESCORE git repo - https://phabricator.wikimedia.org/T404688#11184085 (10SLyngshede-WMF) p:05Triage→03High [09:52:08] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2010.codfw.wmnet with OS bookworm [09:52:15] (03CR) 10Fabfur: [C:03+2] haproxy: use utf8ps converter on received headers [puppet] - 10https://gerrit.wikimedia.org/r/1188379 (https://phabricator.wikimedia.org/T401383) (owner: 10Fabfur) [09:52:25] 06SRE: Allow Puppet to pull in XCHEESESCORE git repo - https://phabricator.wikimedia.org/T404688#11184097 (10SLyngshede-WMF) [09:52:55] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:52:59] RECOVERY - Kafka broker TLS certificate validity on kafka-test1009 is OK: SSL OK - Certificate kafka-test1009.eqiad.wmnet valid until 2026-08-23 08:32:00 +0000 (expires in 340 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [09:53:57] RECOVERY - Kafka broker TLS certificate validity on kafka-test1010 is OK: SSL OK - Certificate kafka-test1010.eqiad.wmnet valid until 2026-08-23 08:23:00 +0000 (expires in 340 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [09:54:21] !log elukey@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-test-eqiad [09:56:28] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#11184115 (10Jelto) My proposal to move forward is to sync the files from object storage to a local folder on the GitLab host. Ideal... [09:57:02] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:58:54] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1012.eqiad.wmnet with reason: host reimage [09:59:58] (03CR) 10Effie Mouzeli: [C:03+2] P:hcaptcha: unset more headers [puppet] - 10https://gerrit.wikimedia.org/r/1188367 (https://phabricator.wikimedia.org/T403416) (owner: 10Effie Mouzeli) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1000) [10:00:04] claime: A patch you scheduled for MediaWiki infrastructure (UTC mid-day) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [10:01:01] jynus: volans: heads up, I'm going to start deploying a change to multi-dc.lua on cp nodes https://gerrit.wikimedia.org/r/c/1182815/ [10:01:19] cc fabfur ^ [10:01:53] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T402925)', diff saved to https://phabricator.wikimedia.org/P83355 and previous config saved to /var/cache/conftool/dbconfig/20250916-100152-ladsgroup.json [10:01:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:01:58] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [10:02:16] (03PS1) 10Brouberol: deployment_server: allow different namespaces to be deployed within a same cluster group [puppet] - 10https://gerrit.wikimedia.org/r/1188734 (https://phabricator.wikimedia.org/T404068) [10:02:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:03:09] (03PS2) 10Brouberol: deployment_server: allow different namespaces to be deployed within a group [puppet] - 10https://gerrit.wikimedia.org/r/1188734 (https://phabricator.wikimedia.org/T404068) [10:03:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:04:16] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1012.eqiad.wmnet with reason: host reimage [10:04:17] (03CR) 10Effie Mouzeli: [C:03+2] P:hcaptcha: add temporary redirect [puppet] - 10https://gerrit.wikimedia.org/r/1188380 (https://phabricator.wikimedia.org/T404251) (owner: 10Effie Mouzeli) [10:06:13] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS bookworm [10:06:40] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11184198 (10Joe) >>! In T400119#11115049, @TheDJ wrote: > Yeah getting the swagger spec via `curl https://api.wikimedia.org/core/v1/wikipedia/en/search/pag... [10:08:30] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2010.codfw.wmnet with OS bookworm [10:09:55] (03PS1) 10Effie Mouzeli: P:hcaptcha: typo (oops) [puppet] - 10https://gerrit.wikimedia.org/r/1188737 [10:10:32] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - proxoid_4260: Servers urldownloader1004.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:10:32] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - proxoid_4260: Servers urldownloader1004.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:10:42] !log tests looks good, enable puppet on A:cp (T401383) [10:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:46] T401383: Reduce noise from duplicate sequence-gap alerts on HaProxy-webrequests - https://phabricator.wikimedia.org/T401383 [10:11:17] (03CR) 10Effie Mouzeli: [C:03+2] P:hcaptcha: typo (oops) [puppet] - 10https://gerrit.wikimedia.org/r/1188737 (owner: 10Effie Mouzeli) [10:11:44] ^^ that is me [10:11:51] it is ok [10:12:25] FIRING: SystemdUnitFailed: nginx.service on urldownloader1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:13:32] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:14:00] FIRING: [14x] CertAlmostExpired: Certificate for service cloudsw1-b1-codfw.mgmt.codfw.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:14:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T402925)', diff saved to https://phabricator.wikimedia.org/P83356 and previous config saved to /var/cache/conftool/dbconfig/20250916-101420-ladsgroup.json [10:14:25] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [10:15:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187033 (https://phabricator.wikimedia.org/T404218) (owner: 10Anzx) [10:15:04] (03PS3) 10Kosta Harlan: hCaptcha: Set wgHCaptchaApiUrlIntegrityHash and pin secure-api.js version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187079 (https://phabricator.wikimedia.org/T404251) [10:15:32] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:17:01] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P83357 and previous config saved to /var/cache/conftool/dbconfig/20250916-101700-ladsgroup.json [10:17:04] fabfur: ah, you're deploying things on cp nodes? [10:17:16] Should I wait a little for https://gerrit.wikimedia.org/r/c/1182815/ ? [10:17:25] RESOLVED: SystemdUnitFailed: nginx.service on urldownloader1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:17:46] (03CR) 10Muehlenhoff: "That seems fine, only the the ipblocks/abuse hierarchy is sourced by the ferm requestctl support and those rules are mostly made in reacti" [puppet] - 10https://gerrit.wikimedia.org/r/1188300 (https://phabricator.wikimedia.org/T402014) (owner: 10JMeybohm) [10:18:00] claime: I reenabled puppet on A:cp, no problem on my side to proceed with other changes [10:18:17] but thanks for noticing! [10:18:44] fabfur: ack, but since I'll have to re-disable puppet on A:cp, I may still need to wait [10:18:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:18:53] otherwise your change may not deploy in isolation [10:19:32] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1012.eqiad.wmnet with OS trixie [10:23:49] it's ok for me [10:24:02] so, the ongoing recovery of restarted is claime and the recovery of urldownloader was effie, right? [10:24:11] No, I've touched nothing yet [10:24:13] (03PS1) 10Federico Ceratto: preseed.yaml: Remove es2050 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1188738 (https://phabricator.wikimedia.org/T402859) [10:24:14] ah [10:25:28] (03CR) 10Alexandros Kosiaris: [C:03+2] deploy: Set HELM_DIFF_OUTPUT_CONTEXT=5 in kube_env.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188731 (owner: 10Alexandros Kosiaris) [10:25:33] anything urldownloader is effie rn though :) [10:27:38] !log sudo cumin 'A:cp' "disable-puppet 'Deploying multi-dc.lua changes - T402412 - ${USER}'" [10:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:43] T402412: Route test2wiki rest.php APIs through rest-gateway - https://phabricator.wikimedia.org/T402412 [10:28:11] riposoqualita@gmail.com [10:28:16] ops, bad paste [10:28:25] 👀 [10:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:29:29] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P83358 and previous config saved to /var/cache/conftool/dbconfig/20250916-102928-ladsgroup.json [10:29:37] (03CR) 10Clément Goubert: [C:03+2] multi-dc: Dynamic rewrite to -ro destinations [puppet] - 10https://gerrit.wikimedia.org/r/1182815 (https://phabricator.wikimedia.org/T402412) (owner: 10Clément Goubert) [10:29:58] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11184292 (10elukey) 05Resolved→03Open >>! In T394357#11162710, @MatthewVernon wrote: > Hi @Jhancock.wm / @elukey . I've found 2 show-stoppers thus far (the second of which has... [10:30:27] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11184299 (10elukey) The host doesn't PXE/HTTP boot for some reason, I reopened the provision task in T394357#11184292. [10:31:01] !log Enabling puppet for testing on cp6011 and cp2041 - T402412 - T400131 [10:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:09] T400131: Improved API rerouting strategy for REST gateway - https://phabricator.wikimedia.org/T400131 [10:31:44] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS bookworm [10:32:08] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P83359 and previous config saved to /var/cache/conftool/dbconfig/20250916-103208-ladsgroup.json [10:33:23] jynus: yes we are alright [10:33:39] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#11184312 (10jcrespo) If that unblocks you, I am ok with that- sadly because other priorities keep entering data persistence with un... [10:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:37:03] elukey@cumin1003 reimage (PID 2807097) is awaiting input [10:42:29] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS bookworm [10:43:13] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:44:37] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P83360 and previous config saved to /var/cache/conftool/dbconfig/20250916-104436-ladsgroup.json [10:45:33] (03CR) 10Ladsgroup: [C:03+1] es2049.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1185879 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:46:09] My tests look good but something looks wrong-ish with the rest-gateway [10:46:26] it's serving 30 5xx per second since a bit past 0955 [10:47:09] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:47:16] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T402925)', diff saved to https://phabricator.wikimedia.org/P83361 and previous config saved to /var/cache/conftool/dbconfig/20250916-104715-ladsgroup.json [10:47:20] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [10:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:49:15] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:50:02] Ugh proton again [10:50:24] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11184382 (10cmooney) Hey @papaul yeah Thursday will be fine thanks. [10:52:16] I'm moving forward despite this, I'll diagnose it in parallel, it's unrelated to the change [10:52:31] !log sudo cumin 'A:cp' "enable-puppet 'Deploying multi-dc.lua changes - T402412 - ${USER}'" [10:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:36] T402412: Route test2wiki rest.php APIs through rest-gateway - https://phabricator.wikimedia.org/T402412 [10:53:09] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:56:48] (03CR) 10Federico Ceratto: [C:03+2] es2049.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1185879 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:57:21] elukey@cumin1003 interactive (PID 2810037) is awaiting input [10:57:43] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for mszwarc - https://phabricator.wikimedia.org/T404697 (10mszwarc) 03NEW [10:59:44] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T402925)', diff saved to https://phabricator.wikimedia.org/P83363 and previous config saved to /var/cache/conftool/dbconfig/20250916-105944-ladsgroup.json [10:59:49] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [11:00:36] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for mszwarc - https://phabricator.wikimedia.org/T404697#11184443 (10OKryva-WMF) As Marcin's Engineering Manager, approve. [11:18:19] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "update for routed Ganeti - jmm@cumin2002" [11:18:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "update for routed Ganeti - jmm@cumin2002" [11:19:00] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:21:26] FIRING: SystemdUnitFailed: logrotate.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:23:04] (03PS4) 10Stevemunene: dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) [11:25:05] (03CR) 10CI reject: [V:04-1] dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) (owner: 10Stevemunene) [11:27:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:31:44] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:31:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:33:22] (03CR) 10Btullis: [C:03+1] "Looks good to me, but I would like to make sure that others are also able to review, for visibility." [puppet] - 10https://gerrit.wikimedia.org/r/1188734 (https://phabricator.wikimedia.org/T404068) (owner: 10Brouberol) [11:39:23] (03PS1) 10Kosta Harlan: P:hcaptcha: Adjust regex match [puppet] - 10https://gerrit.wikimedia.org/r/1188750 (https://phabricator.wikimedia.org/T404251) [11:39:31] (03PS1) 10Clément Goubert: Revert^2 "trafficserver: test2wiki rest API to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1188751 [11:40:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install3004.wikimedia.org [11:41:42] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 8.218 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:41:42] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 8.385 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:41:47] (03PS2) 10Kosta Harlan: P:hcaptcha: Adjust regex match [puppet] - 10https://gerrit.wikimedia.org/r/1188750 (https://phabricator.wikimedia.org/T404251) [11:42:09] (03CR) 10Hnowlan: [C:03+1] Revert^2 "trafficserver: test2wiki rest API to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1188751 (owner: 10Clément Goubert) [11:43:08] (03CR) 10Effie Mouzeli: [C:03+2] P:hcaptcha: Adjust regex match [puppet] - 10https://gerrit.wikimedia.org/r/1188750 (https://phabricator.wikimedia.org/T404251) (owner: 10Kosta Harlan) [11:44:19] (03CR) 10CI reject: [V:04-1] P:hcaptcha: Adjust regex match [puppet] - 10https://gerrit.wikimedia.org/r/1188750 (https://phabricator.wikimedia.org/T404251) (owner: 10Kosta Harlan) [11:45:48] (03PS3) 10Kosta Harlan: P:hcaptcha: Adjust regex match [puppet] - 10https://gerrit.wikimedia.org/r/1188750 (https://phabricator.wikimedia.org/T404251) [11:47:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install3004.wikimedia.org [11:48:08] (03CR) 10CI reject: [V:04-1] P:hcaptcha: Adjust regex match [puppet] - 10https://gerrit.wikimedia.org/r/1188750 (https://phabricator.wikimedia.org/T404251) (owner: 10Kosta Harlan) [11:48:59] RESOLVED: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:52:33] (03PS4) 10Kosta Harlan: P:hcaptcha: Adjust regex match [puppet] - 10https://gerrit.wikimedia.org/r/1188750 (https://phabricator.wikimedia.org/T404251) [11:54:51] (03CR) 10Clément Goubert: [C:03+2] Revert^2 "trafficserver: test2wiki rest API to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1188751 (owner: 10Clément Goubert) [11:54:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11184626 (10phaultfinder) [11:58:51] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11184633 (10elukey) There is something definitely off, I just tested the following and everything hangs: ` -> reset /system1/pwrmgtsvc1 /system1/pwrmgtsvc1 ` I am trying to set... [11:59:40] (03PS1) 10Hnowlan: (api|rest)-gateway: set Server header if supplied by service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188758 (https://phabricator.wikimedia.org/T401396) [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1200) [12:00:47] (03Abandoned) 10Kosta Harlan: hCaptcha: Special handling for hcaptcha-secure-api.js requests [puppet] - 10https://gerrit.wikimedia.org/r/1187439 (https://phabricator.wikimedia.org/T404251) (owner: 10Kosta Harlan) [12:03:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:05:46] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11184653 (10MoritzMuehlenhoff) There was a small issue with install3004, it lacked the global ipv6 address, which caused failing ipv6 probes to Squid. The rele... [12:08:40] (03PS1) 10Esanders: Enable Flow in read-only mode on wikis using LiquidThreads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188759 (https://phabricator.wikimedia.org/T404687) [12:08:43] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Rebalance s7 in eqiad (T403966)', diff saved to https://phabricator.wikimedia.org/P83364 and previous config saved to /var/cache/conftool/dbconfig/20250916-120842-ladsgroup.json [12:08:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:08:48] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [12:09:51] (03CR) 10CI reject: [V:04-1] Enable Flow in read-only mode on wikis using LiquidThreads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188759 (https://phabricator.wikimedia.org/T404687) (owner: 10Esanders) [12:15:45] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1194 from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83365 and previous config saved to /var/cache/conftool/dbconfig/20250916-121545-ladsgroup.json [12:15:50] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [12:18:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11184704 (10cmooney) >>! In T404609#11181649, @RobH wrote: > @cmooney: What do you think is the best way to go about migrating these connections on upcoming C... [12:18:20] !log depooling cp2041 - T402412 [12:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:25] T402412: Route test2wiki rest.php APIs through rest-gateway - https://phabricator.wikimedia.org/T402412 [12:19:58] (03PS2) 10Hnowlan: (api|rest)-gateway: set Server header if supplied by service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188758 (https://phabricator.wikimedia.org/T401396) [12:22:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11184711 (10cmooney) @RobH @Jclark-ctr there is also another way we could try to approach this so may as well mention it now before we start planning. Rack-b... [12:37:06] (03PS1) 10Huei Tan: xLab: Fix instrument to produce valid events [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188765 (https://phabricator.wikimedia.org/T404420) [12:37:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188765 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan) [12:40:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11184793 (10Jclark-ctr) @cmooney I’m flexible to try either way. Maybe a mix could work? We could start with roles that aren’t single points of failure and ar... [12:46:38] (03PS1) 10Urbanecm: feat: Allow communities to opt out experienced users from mentorship [extensions/GrowthExperiments] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188767 (https://phabricator.wikimedia.org/T403563) [12:46:46] jouncebot: nowandnext [12:46:46] For the next 0 hour(s) and 13 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1200) [12:46:46] In 0 hour(s) and 13 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1300) [12:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:49:14] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [12:49:38] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [12:50:58] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [12:53:22] (03PS2) 10Federico Ceratto: preseed.yaml: Remove es2050 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1188738 (https://phabricator.wikimedia.org/T402859) [12:53:22] (03PS1) 10Federico Ceratto: instances.yaml: add es2049 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1188769 (https://phabricator.wikimedia.org/T402859) [12:56:11] (03CR) 10Effie Mouzeli: [C:03+2] P:hcaptcha: Adjust regex match [puppet] - 10https://gerrit.wikimedia.org/r/1188750 (https://phabricator.wikimedia.org/T404251) (owner: 10Kosta Harlan) [12:57:52] (03PS1) 10Slyngshede: P:puppetserver::volatile xcheese [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) [12:58:37] o/ i need someone help with my patch deployment. [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1300). [13:00:05] joelyrookewmde, tgr, anzx, and hueitan: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:15] o/ i need someone help with my patch deployment. [13:00:19] (03CR) 10CI reject: [V:04-1] P:puppetserver::volatile xcheese [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede) [13:00:50] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1188374 (https://phabricator.wikimedia.org/T317362) (owner: 10Filippo Giunchedi) [13:01:00] I can deploy [13:01:27] o/ [13:01:29] tgr tq [13:01:33] o/ [13:01:42] I’m in a meeting, thanks tgr for deploying :) [13:02:02] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1253 from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83366 and previous config saved to /var/cache/conftool/dbconfig/20250916-130201-ladsgroup.json [13:02:06] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [13:03:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187033 (https://phabricator.wikimedia.org/T404218) (owner: 10Anzx) [13:03:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184480 (https://phabricator.wikimedia.org/T395674) (owner: 10Joely Rooke WMDE) [13:05:01] (03Merged) 10jenkins-bot: Lift IP cap for workshop at University of Pretoria on 29-30 September [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187033 (https://phabricator.wikimedia.org/T404218) (owner: 10Anzx) [13:05:04] (03Merged) 10jenkins-bot: Remove feature flag to resolve changelist wikibase link labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184480 (https://phabricator.wikimedia.org/T395674) (owner: 10Joely Rooke WMDE) [13:05:21] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1187033|Lift IP cap for workshop at University of Pretoria on 29-30 September (T404218)]], [[gerrit:1184480|Remove feature flag to resolve changelist wikibase link labels (T395674)]] [13:05:26] T404218: Request for IP exemption for event with University of Pretoria on 2025-09-29 - https://phabricator.wikimedia.org/T404218 [13:05:27] T395674: Post-acceptance cleanup for adding labels to Wikidata recent changes - https://phabricator.wikimedia.org/T395674 [13:06:19] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1191 from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83367 and previous config saved to /var/cache/conftool/dbconfig/20250916-130618-ladsgroup.json [13:09:35] 06SRE, 06FR-donorrelations, 06Infrastructure-Foundations, 10Mail: Donations@ doesn't forward to donate@ - https://phabricator.wikimedia.org/T403986#11184909 (10AMJohnson) 05Open→03Resolved a:03AMJohnson @DSeyfert_WMF was able to fix this for us. Thank you, Dustin! Going ahead and closing out this... [13:09:36] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1202 from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83368 and previous config saved to /var/cache/conftool/dbconfig/20250916-130935-ladsgroup.json [13:09:41] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [13:10:58] !log tgr@deploy1003 tgr, joelyrookewmde, anzx: Backport for [[gerrit:1187033|Lift IP cap for workshop at University of Pretoria on 29-30 September (T404218)]], [[gerrit:1184480|Remove feature flag to resolve changelist wikibase link labels (T395674)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:11:04] T404218: Request for IP exemption for event with University of Pretoria on 2025-09-29 - https://phabricator.wikimedia.org/T404218 [13:11:04] T395674: Post-acceptance cleanup for adding labels to Wikidata recent changes - https://phabricator.wikimedia.org/T395674 [13:11:25] tgr: nothing to test on throttle [13:11:52] joelyrookewmde: I assume you don't need to test either? [13:11:53] @tgr sorry I missed the start of this deployment. Thanks for approving it ! [13:12:02] no all good for me [13:12:10] !log tgr@deploy1003 tgr, joelyrookewmde, anzx: Continuing with sync [13:13:45] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Rebalance s7 in codfw (T403966)', diff saved to https://phabricator.wikimedia.org/P83369 and previous config saved to /var/cache/conftool/dbconfig/20250916-131345-ladsgroup.json [13:14:25] hueitan: do you feel confident about your patch? if it's low-risk, I'll bundle it with the other backports [13:14:39] yes, confident [13:15:21] (03CR) 10Stevemunene: [C:03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1188734 (https://phabricator.wikimedia.org/T404068) (owner: 10Brouberol) [13:17:35] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1187033|Lift IP cap for workshop at University of Pretoria on 29-30 September (T404218)]], [[gerrit:1184480|Remove feature flag to resolve changelist wikibase link labels (T395674)]] (duration: 12m 14s) [13:17:41] T404218: Request for IP exemption for event with University of Pretoria on 2025-09-29 - https://phabricator.wikimedia.org/T404218 [13:17:42] T395674: Post-acceptance cleanup for adding labels to Wikidata recent changes - https://phabricator.wikimedia.org/T395674 [13:17:44] (03CR) 10Phuedx: [C:03+1] xLab: Fix instrument to produce valid events [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188765 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan) [13:19:45] (03CR) 10Phuedx: [C:03+1] xLab: Fix instrument to produce valid events (031 comment) [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188765 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan) [13:20:15] tgr: Just reviewed it. It's low risk and will reduce event validation errors back to the baseline rate [13:20:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188715 (https://phabricator.wikimedia.org/T401748) (owner: 10Gergő Tisza) [13:20:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188716 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza) [13:20:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188717 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza) [13:20:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188718 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza) [13:20:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188765 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan) [13:22:02] (03CR) 10Muehlenhoff: [C:03+1] "LGTM, one typo inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/1188719 (https://phabricator.wikimedia.org/T403691) (owner: 10Slyngshede) [13:22:25] (03PS1) 10Clément Goubert: Revert^3 "trafficserver: test2wiki rest API to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1188774 [13:22:37] (03CR) 10Clément Goubert: [V:03+2 C:03+2] Revert^3 "trafficserver: test2wiki rest API to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1188774 (owner: 10Clément Goubert) [13:23:25] (03CR) 10Filippo Giunchedi: [C:03+2] profile: clean up root-authorized-key.erb transition [puppet] - 10https://gerrit.wikimedia.org/r/1188374 (https://phabricator.wikimedia.org/T317362) (owner: 10Filippo Giunchedi) [13:23:29] jouncebot: nowandnext [13:23:29] For the next 0 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1300) [13:23:29] In 0 hour(s) and 36 minute(s): Metrics Platform Experimentation Lab Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1400) [13:23:47] (03PS2) 10Slyngshede: P:puppetserver::volatile xcheese [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) [13:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:25:20] (03PS1) 10Andrew Bogott: codfw1dev: bump horizon build version [puppet] - 10https://gerrit.wikimedia.org/r/1188776 [13:25:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:25:52] (03Merged) 10jenkins-bot: User: Simplify makeUpdateConditions() [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188715 (https://phabricator.wikimedia.org/T401748) (owner: 10Gergő Tisza) [13:25:56] (03Merged) 10jenkins-bot: session: Add a mechanism for forcing a refresh [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188716 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza) [13:26:02] (03Merged) 10jenkins-bot: Use short expiry for JWT cookies [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188717 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza) [13:26:04] (03Merged) 10jenkins-bot: tests: Update for SessionCookieJwtExpiration added in core [extensions/CentralAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188718 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza) [13:26:23] (03CR) 10CI reject: [V:04-1] P:puppetserver::volatile xcheese [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede) [13:27:05] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev: bump horizon build version [puppet] - 10https://gerrit.wikimedia.org/r/1188776 (owner: 10Andrew Bogott) [13:29:22] (03Merged) 10jenkins-bot: xLab: Fix instrument to produce valid events [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188765 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan) [13:29:43] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1188715|User: Simplify makeUpdateConditions() (T401748)]], [[gerrit:1188716|session: Add a mechanism for forcing a refresh (T399200)]], [[gerrit:1188717|Use short expiry for JWT cookies (T399200)]], [[gerrit:1188718|tests: Update for SessionCookieJwtExpiration added in core (T399200 T404667)]], [[gerrit:1188765|xLab: Fix instrument to produce valid events [13:29:43] (T404420)]] [13:29:51] T401748: Unexpected Phan SecurityCheck failure in UpdateQueryBuilder::execute - https://phabricator.wikimedia.org/T401748 [13:29:52] T399200: Update existing cookie-based sessions to include JWT cookie - https://phabricator.wikimedia.org/T399200 [13:29:52] T404667: CentralAuth tests failing - https://phabricator.wikimedia.org/T404667 [13:29:53] T404420: Enable 13 wikis for MinT for Wiki Readers A/A test - https://phabricator.wikimedia.org/T404420 [13:30:15] (03PS4) 10Kosta Harlan: hCaptcha: Set wgHCaptchaApiUrlIntegrityHash and pin secure-api.js version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187079 (https://phabricator.wikimedia.org/T404251) [13:30:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:31:51] (03PS2) 10Majavah: hieradata: Drop old eqiad1 bastions [puppet] - 10https://gerrit.wikimedia.org/r/1187804 (https://phabricator.wikimedia.org/T392689) [13:31:51] (03PS1) 10Majavah: hieradata: openstack: Update Toolforge bastion example [puppet] - 10https://gerrit.wikimedia.org/r/1188778 (https://phabricator.wikimedia.org/T392510) [13:32:51] (03CR) 10Majavah: [C:03+2] hieradata: Drop old eqiad1 bastions [puppet] - 10https://gerrit.wikimedia.org/r/1187804 (https://phabricator.wikimedia.org/T392689) (owner: 10Majavah) [13:33:02] (03CR) 10Majavah: [C:03+2] hieradata: openstack: Update Toolforge bastion example [puppet] - 10https://gerrit.wikimedia.org/r/1188778 (https://phabricator.wikimedia.org/T392510) (owner: 10Majavah) [13:33:21] (03PS3) 10Slyngshede: P:puppetserver::volatile xcheese [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) [13:33:59] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:35:15] !log repooling cp2041, test inconclusive, rolled back - T402412 [13:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:21] T402412: Route test2wiki rest.php APIs through rest-gateway - https://phabricator.wikimedia.org/T402412 [13:35:40] !log tgr@deploy1003 hueitan, tgr: Backport for [[gerrit:1188715|User: Simplify makeUpdateConditions() (T401748)]], [[gerrit:1188716|session: Add a mechanism for forcing a refresh (T399200)]], [[gerrit:1188717|Use short expiry for JWT cookies (T399200)]], [[gerrit:1188718|tests: Update for SessionCookieJwtExpiration added in core (T399200 T404667)]], [[gerrit:1188765|xLab: Fix instrument to produce valid events (T404420)]] [13:35:40] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:35:48] T401748: Unexpected Phan SecurityCheck failure in UpdateQueryBuilder::execute - https://phabricator.wikimedia.org/T401748 [13:35:49] T399200: Update existing cookie-based sessions to include JWT cookie - https://phabricator.wikimedia.org/T399200 [13:35:49] T404667: CentralAuth tests failing - https://phabricator.wikimedia.org/T404667 [13:35:50] T404420: Enable 13 wikis for MinT for Wiki Readers A/A test - https://phabricator.wikimedia.org/T404420 [13:36:51] (03CR) 10CI reject: [V:04-1] P:puppetserver::volatile xcheese [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede) [13:38:35] (y) [13:40:54] (03PS4) 10Slyngshede: P:puppetserver::volatile xcheese [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) [13:41:58] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1017.eqiad.wmnet with OS bookworm [13:43:23] !log tgr@deploy1003 hueitan, tgr: Continuing with sync [13:43:28] (03PS1) 10Andrew Bogott: Prepare cloudcephosd1017 for bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1188782 (https://phabricator.wikimedia.org/T404249) [13:44:38] (03PS3) 10Majavah: P:toolforge: remove support for grid bastions [puppet] - 10https://gerrit.wikimedia.org/r/1012752 (https://phabricator.wikimedia.org/T314665) [13:44:39] (03PS1) 10Majavah: P:toolforge::checker: Remove grid base profile [puppet] - 10https://gerrit.wikimedia.org/r/1188783 (https://phabricator.wikimedia.org/T314664) [13:44:40] (03PS1) 10Majavah: P:toolforge: Cleanup buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188784 [13:45:03] (03PS1) 10Andrew Bogott: Prepare cloudcephosd105* for bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1188785 (https://phabricator.wikimedia.org/T404249) [13:45:12] (03CR) 10Andrew Bogott: [C:03+2] Prepare cloudcephosd1017 for bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1188782 (https://phabricator.wikimedia.org/T404249) (owner: 10Andrew Bogott) [13:48:27] (03PS4) 10Superpes15: Throttle exemption for Editathon by Wikimedistas en Cruce - 26 September 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188493 (https://phabricator.wikimedia.org/T404592) [13:48:33] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1188715|User: Simplify makeUpdateConditions() (T401748)]], [[gerrit:1188716|session: Add a mechanism for forcing a refresh (T399200)]], [[gerrit:1188717|Use short expiry for JWT cookies (T399200)]], [[gerrit:1188718|tests: Update for SessionCookieJwtExpiration added in core (T399200 T404667)]], [[gerrit:1188765|xLab: Fix instrument to produce valid events [13:48:33] (T404420)]] (duration: 18m 50s) [13:48:41] T401748: Unexpected Phan SecurityCheck failure in UpdateQueryBuilder::execute - https://phabricator.wikimedia.org/T401748 [13:48:42] T399200: Update existing cookie-based sessions to include JWT cookie - https://phabricator.wikimedia.org/T399200 [13:48:42] T404667: CentralAuth tests failing - https://phabricator.wikimedia.org/T404667 [13:48:43] (03CR) 10Bking: [C:03+2] admin_ng: allow opensearch deploy to use role/rolebinding [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188446 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [13:48:43] T404420: Enable 13 wikis for MinT for Wiki Readers A/A test - https://phabricator.wikimedia.org/T404420 [13:50:44] (03PS18) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [13:50:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186593 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [13:51:11] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:51:29] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:51:37] !log bking@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [13:51:58] (03Merged) 10jenkins-bot: Enable JWT session cookies on testwiki and beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186593 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [13:52:11] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1186593|Enable JWT session cookies on testwiki and beta (T399631)]] [13:52:15] T399631: Deploy JWT cookies to production - https://phabricator.wikimedia.org/T399631 [13:52:39] (03PS4) 10Majavah: P:toolforge: Remove support for grid bastions [puppet] - 10https://gerrit.wikimedia.org/r/1012752 (https://phabricator.wikimedia.org/T314665) [13:52:39] (03PS2) 10Majavah: P:toolforge: Cleanup buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188784 [13:53:29] !log bking@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [13:53:59] FIRING: [17x] CertAlmostExpired: Certificate for service cloudsw1-b1-codfw.mgmt.codfw.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:54:08] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6955/co" [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede) [13:54:34] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Bump vslow replicas of s4 in eqiad to 300 (T403966)', diff saved to https://phabricator.wikimedia.org/P83370 and previous config saved to /var/cache/conftool/dbconfig/20250916-135433-ladsgroup.json [13:54:38] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [13:55:25] (03CR) 10Andrew Bogott: [C:03+2] Prepare cloudcephosd105* for bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1188785 (https://phabricator.wikimedia.org/T404249) (owner: 10Andrew Bogott) [13:55:42] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1160 (candidate master of s4) from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83371 and previous config saved to /var/cache/conftool/dbconfig/20250916-135542-ladsgroup.json [13:57:40] !log tgr@deploy1003 tgr: Backport for [[gerrit:1186593|Enable JWT session cookies on testwiki and beta (T399631)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:57:45] T399631: Deploy JWT cookies to production - https://phabricator.wikimedia.org/T399631 [13:58:09] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1188784 (owner: 10Majavah) [13:58:39] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11185177 (10Jhancock.wm) @elukey 2049 was powered off. once i powered it on the nic came up. I'll not set the root for 2053-8 [13:58:59] FIRING: [19x] CertAlmostExpired: Certificate for service cloudsw1-b1-codfw.mgmt.codfw.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:00:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1199 from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83372 and previous config saved to /var/cache/conftool/dbconfig/20250916-140020-ladsgroup.json [14:00:26] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [14:01:48] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1247 from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83373 and previous config saved to /var/cache/conftool/dbconfig/20250916-140147-ladsgroup.json [14:01:59] (03PS1) 10Majavah: kubeadm: Explicitely install kubelet [puppet] - 10https://gerrit.wikimedia.org/r/1188788 [14:02:38] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Fix db1242 weight in s4 (T403966)', diff saved to https://phabricator.wikimedia.org/P83374 and previous config saved to /var/cache/conftool/dbconfig/20250916-140237-ladsgroup.json [14:02:46] (03CR) 10JMeybohm: [C:03+1] "LGTM, but please keep in mind that files/certs already created on the deployment servers will not be cleaned up. You might want to do so m" [puppet] - 10https://gerrit.wikimedia.org/r/1188734 (https://phabricator.wikimedia.org/T404068) (owner: 10Brouberol) [14:03:08] (03CR) 10David Caro: [C:03+1] kubeadm: Explicitely install kubelet [puppet] - 10https://gerrit.wikimedia.org/r/1188788 (owner: 10Majavah) [14:03:13] (03CR) 10Majavah: [C:03+2] kubeadm: Explicitely install kubelet [puppet] - 10https://gerrit.wikimedia.org/r/1188788 (owner: 10Majavah) [14:03:36] (03CR) 10Brouberol: [C:03+2] "Good call @jmeybohm@wikimedia.org thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1188734 (https://phabricator.wikimedia.org/T404068) (owner: 10Brouberol) [14:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:03:49] !log tgr@deploy1003 tgr: Continuing with sync [14:03:59] FIRING: [23x] CertAlmostExpired: Certificate for service asw1-b12-drmrs.mgmt.drmrs.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:06:11] 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11185242 (10Pcoombe) For fundraising banners we use the country from `mw.centralNotice.data.country` (which allows us to... [14:06:39] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Rebalance s4 in eqiad (T403966)', diff saved to https://phabricator.wikimedia.org/P83375 and previous config saved to /var/cache/conftool/dbconfig/20250916-140638-ladsgroup.json [14:06:44] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [14:09:13] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1017.eqiad.wmnet with reason: host reimage [14:09:15] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1186593|Enable JWT session cookies on testwiki and beta (T399631)]] (duration: 17m 04s) [14:09:20] T399631: Deploy JWT cookies to production - https://phabricator.wikimedia.org/T399631 [14:09:39] (03PS7) 10Scott French: hieradata: migrate parsoidtest1001 to 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1184119 (https://phabricator.wikimedia.org/T403772) [14:10:03] !log UTC afternoon deploys done [14:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:36] 06SRE, 06FR-donorrelations, 06Infrastructure-Foundations, 10Mail: Donations@ doesn't forward to donate@ - https://phabricator.wikimedia.org/T403986#11185266 (10Aklapper) a:05AMJohnson→03DSeyfert_WMF [14:11:20] (03CR) 10Clément Goubert: [C:03+1] hieradata: migrate parsoidtest1001 to 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1184119 (https://phabricator.wikimedia.org/T403772) (owner: 10Scott French) [14:11:39] (03PS2) 10Stevemunene: dse-k8s:Enable CSI and the Ceph CSI plugin on dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188754 (https://phabricator.wikimedia.org/T404576) [14:13:10] FIRING: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:13:58] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1017.eqiad.wmnet with reason: host reimage [14:16:03] (03CR) 10Scott French: [C:03+2] hieradata: migrate parsoidtest1001 to 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1184119 (https://phabricator.wikimedia.org/T403772) (owner: 10Scott French) [14:18:10] RESOLVED: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:18:59] FIRING: [27x] CertAlmostExpired: Certificate for service asw1-b12-drmrs.mgmt.drmrs.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:19:13] (03CR) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [14:19:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11185300 (10phaultfinder) [14:23:37] 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11185307 (10AKanji-WMF) @XenoRyet and I discussed getting this into our next Sprint as a stretch. [14:27:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173396 (https://phabricator.wikimedia.org/T400048) (owner: 10Cyndywikime) [14:29:09] (03CR) 10Michael Große: [C:03+1] beta(Growth,MetricsPlatform): add notification experiment config and enable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173396 (https://phabricator.wikimedia.org/T400048) (owner: 10Cyndywikime) [14:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:30:33] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1017.eqiad.wmnet with OS bookworm [14:30:44] (03PS3) 10Sergio Gimeno: beta(Growth,MetricsPlatform): add notification experiment config and enable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173396 (https://phabricator.wikimedia.org/T400048) (owner: 10Cyndywikime) [14:32:18] (03PS1) 10Arnaudb: Revert^3 "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188796 [14:33:30] (03PS1) 10Giuseppe Lavagetto: Add inline patterns [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1188797 [14:34:11] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Add inline patterns [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1188797 (owner: 10Giuseppe Lavagetto) [14:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:37:02] (03PS1) 10Arnaudb: Revert^4 "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188798 [14:37:27] (03PS1) 10Sbisson: SpecialContribute: configure new page target [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188799 (https://phabricator.wikimedia.org/T327063) [14:38:39] (03CR) 10CI reject: [V:04-1] SpecialContribute: configure new page target [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188799 (https://phabricator.wikimedia.org/T327063) (owner: 10Sbisson) [14:38:47] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Add inline pattern support - oblivian@cumin1003" [14:38:48] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Add inline pattern support - oblivian@cumin1003 [14:39:08] (03PS1) 10Muehlenhoff: imposm-initial-import: Fix check whether imposm is running [puppet] - 10https://gerrit.wikimedia.org/r/1188801 (https://phabricator.wikimedia.org/T381565) [14:39:27] (03PS2) 10Muehlenhoff: imposm-initial-import: Fix check whether imposm is running [puppet] - 10https://gerrit.wikimedia.org/r/1188801 (https://phabricator.wikimedia.org/T381565) [14:39:34] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Add inline pattern support - oblivian@cumin1003 [14:39:35] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Add inline pattern support - oblivian@cumin1003" [14:39:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11185397 (10phaultfinder) [14:40:13] !log installing libsndfile security updates [14:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:20] (03CR) 10Samuel (WMF): [C:03+1] hCaptcha: Set wgHCaptchaApiUrlIntegrityHash and pin secure-api.js version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187079 (https://phabricator.wikimedia.org/T404251) (owner: 10Kosta Harlan) [14:41:33] (03PS1) 10Andrew Bogott: Update nic IDs for cloudcephosd1017 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1188805 (https://phabricator.wikimedia.org/T404249) [14:42:30] (03PS2) 10Sbisson: SpecialContribute: configure new page target [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188799 (https://phabricator.wikimedia.org/T327063) [14:42:38] (03CR) 10Andrew Bogott: [C:03+2] Update nic IDs for cloudcephosd1017 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1188805 (https://phabricator.wikimedia.org/T404249) (owner: 10Andrew Bogott) [14:43:40] (03CR) 10CI reject: [V:04-1] SpecialContribute: configure new page target [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188799 (https://phabricator.wikimedia.org/T327063) (owner: 10Sbisson) [14:46:16] (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge::checker: Remove grid base profile [puppet] - 10https://gerrit.wikimedia.org/r/1188783 (https://phabricator.wikimedia.org/T314664) (owner: 10Majavah) [14:46:29] (03CR) 10Majavah: [C:03+2] P:toolforge::checker: Remove grid base profile [puppet] - 10https://gerrit.wikimedia.org/r/1188783 (https://phabricator.wikimedia.org/T314664) (owner: 10Majavah) [14:48:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:48:48] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11185433 (10elukey) @Jhancock.wm Hi! When you have a moment could you please check if sretest2010 is in a weird state? I am not able to powercycle it.. [14:49:35] !log dancy@deploy1003 Started scap sync-world: Testing for T403882 [14:49:39] T403882: Wikidata N-Triples RDF dumps empty, broken since at least 25 July 2025 - https://phabricator.wikimedia.org/T403882 [14:49:47] jouncebot: nowandnext [14:49:47] For the next 0 hour(s) and 10 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1430) [14:49:47] In 0 hour(s) and 10 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1500) [14:50:41] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:51:47] (03PS5) 10Majavah: P:toolforge: Remove support for grid bastions [puppet] - 10https://gerrit.wikimedia.org/r/1012752 (https://phabricator.wikimedia.org/T314665) [14:51:47] (03PS3) 10Majavah: P:toolforge: Cleanup buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188784 [14:51:47] (03PS1) 10Majavah: P:toolforge: Delete cmd_checklist test suite [puppet] - 10https://gerrit.wikimedia.org/r/1188807 [14:52:39] (03PS3) 10Sbisson: SpecialContribute: configure new page target [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188799 (https://phabricator.wikimedia.org/T327063) [14:52:50] (03CR) 10Filippo Giunchedi: [C:03+1] "spot-checked the most common entry paths, LGTM! feels-good-meme.png" [puppet] - 10https://gerrit.wikimedia.org/r/1012752 (https://phabricator.wikimedia.org/T314665) (owner: 10Majavah) [14:53:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:54:27] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:55:29] (03PS1) 10Muehlenhoff: Reset maps nodes for a fresh import [puppet] - 10https://gerrit.wikimedia.org/r/1188808 [14:57:21] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bookworm [14:57:25] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1052.eqiad.wmnet with OS bookworm [14:57:28] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bookworm [14:58:49] (03CR) 10Elukey: [C:03+1] Reset maps nodes for a fresh import [puppet] - 10https://gerrit.wikimedia.org/r/1188808 (owner: 10Muehlenhoff) [14:59:13] (03PS2) 10Krinkle: Disable wmgUseMdotRouting on cawiki, hewiki, itwiki (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185120 (https://phabricator.wikimedia.org/T403510) [14:59:17] (03PS3) 10Krinkle: varnish: Enable unified mobile routing on cawiki, hewiki, itwiki (group1) [puppet] - 10https://gerrit.wikimedia.org/r/1185116 (https://phabricator.wikimedia.org/T403510) [15:01:36] !log dancy@deploy1003 Finished scap sync-world: Testing for T403882 (duration: 12m 01s) [15:01:40] T403882: Wikidata N-Triples RDF dumps empty, broken since at least 25 July 2025 - https://phabricator.wikimedia.org/T403882 [15:07:38] (03PS1) 10Andrew Bogott: cloudcephosd: add python3-packaging [puppet] - 10https://gerrit.wikimedia.org/r/1188810 (https://phabricator.wikimedia.org/T404249) [15:08:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:09:00] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:24] (03PS2) 10Andrew Bogott: cloudcephosd: add python3-packaging [puppet] - 10https://gerrit.wikimedia.org/r/1188810 (https://phabricator.wikimedia.org/T404249) [15:09:28] (03CR) 10Dzahn: [C:03+2] phabricator: remove defunct ElasticSearch backend settings [puppet] - 10https://gerrit.wikimedia.org/r/1185884 (https://phabricator.wikimedia.org/T403948) (owner: 10Aklapper) [15:09:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11185524 (10phaultfinder) [15:10:14] andre: no phorge deploy? [15:10:33] (03PS1) 10Brouberol: kubernetes: add service secrets for dse-k8s-eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1188811 [15:10:41] (03CR) 10Clément Goubert: [C:03+1] (api|rest)-gateway: set Server header if supplied by service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188758 (https://phabricator.wikimedia.org/T401396) (owner: 10Hnowlan) [15:10:49] (03CR) 10Brouberol: [C:03+2] kubernetes: add service secrets for dse-k8s-eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1188811 (owner: 10Brouberol) [15:10:51] (03CR) 10Brouberol: [V:03+2 C:03+2] kubernetes: add service secrets for dse-k8s-eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1188811 (owner: 10Brouberol) [15:10:57] mutante: not this week, need to test the upstream pull more [15:11:55] andre: ok, ACK! we are doing the puppet patch that removes elasticsearch config [15:12:06] had it planned for the window.. remember [15:12:21] or that was the suggestion.. so getting it out now [15:12:46] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd: add python3-packaging [puppet] - 10https://gerrit.wikimedia.org/r/1188810 (https://phabricator.wikimedia.org/T404249) (owner: 10Andrew Bogott) [15:13:23] mutante, argh, true. Sorry, I forgot that one [15:13:33] (03CR) 10Urbanecm: [C:03+2] feat: Allow communities to opt out experienced users from mentorship [extensions/GrowthExperiments] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188767 (https://phabricator.wikimedia.org/T403563) (owner: 10Urbanecm) [15:14:08] andre: arrr.. I realized this file is under hieradata/role/eqiad .. that is kind of bad [15:14:41] mutante, feel free not to deploy and rethink the problem :) [15:15:23] andre: well.. 2 options here.. either stuff is duplicated for each DC or it needs a second patch for codfw [15:15:32] ok [15:17:46] (03CR) 10Dzahn: [C:03+2] "this only affects eqiad but the same thing also exists in codfw - needs another patch" [puppet] - 10https://gerrit.wikimedia.org/r/1185884 (https://phabricator.wikimedia.org/T403948) (owner: 10Aklapper) [15:18:40] (03PS1) 10Dzahn: phabricator: drop elasticsearch settings in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1188815 (https://phabricator.wikimedia.org/T403948) [15:19:00] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:19:55] (03CR) 10Dzahn: [C:03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1188815" [puppet] - 10https://gerrit.wikimedia.org/r/1185884 (https://phabricator.wikimedia.org/T403948) (owner: 10Aklapper) [15:20:07] (03CR) 10Dzahn: [C:03+2] phabricator: drop elasticsearch settings in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1188815 (https://phabricator.wikimedia.org/T403948) (owner: 10Dzahn) [15:20:40] (03PS1) 10Brouberol: kubernetes: add service secrets for airflow-dev/dse-k8s-eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1188816 [15:20:56] (03CR) 10Brouberol: [C:03+2] kubernetes: add service secrets for airflow-dev/dse-k8s-eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1188816 (owner: 10Brouberol) [15:21:02] (03CR) 10Brouberol: [V:03+2 C:03+2] kubernetes: add service secrets for airflow-dev/dse-k8s-eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1188816 (owner: 10Brouberol) [15:21:41] FIRING: SystemdUnitFailed: logrotate.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:23:18] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:23:51] (03Merged) 10jenkins-bot: feat: Allow communities to opt out experienced users from mentorship [extensions/GrowthExperiments] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188767 (https://phabricator.wikimedia.org/T403563) (owner: 10Urbanecm) [15:24:09] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr1-codfw [15:24:24] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-codfw [15:24:56] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr2-codfw [15:25:10] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-codfw [15:25:23] jouncebot: nowandnext [15:25:23] For the next 0 hour(s) and 34 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1500) [15:25:23] In 0 hour(s) and 34 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1600) [15:26:20] 10SRE-SLO, 10Charts, 06Reader Growth Team: Finalize Charts SLO - https://phabricator.wikimedia.org/T399613#11185583 (10elukey) Thanks a lot for the patience folks, we have stopped onboarding new SLOs in Pyrra temporarily while we figure out T403729. We are comparing the results with another tool in T404171,... [15:26:29] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1188767|feat: Allow communities to opt out experienced users from mentorship (T403563)]] [15:26:32] Anyone mind if I deploy a security patch in this window? [15:26:34] T403563: Do not automatically enroll experienced editors into Mentorship when they visit the Homepage - https://phabricator.wikimedia.org/T403563 [15:26:43] Oh it seems that someone started scap as I said that :D [15:26:44] Dreamy_Jazz: i am currently deploying sth, but no concerns once i'm done [15:26:49] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device asw1-by27-esams [15:26:57] sorry! CI just finished, so it started. [15:27:01] Np [15:27:02] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-by27-esams [15:27:08] (03CR) 10Dzahn: [C:03+2] "Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Function lookup() did not find a value for the na" [puppet] - 10https://gerrit.wikimedia.org/r/1185884 (https://phabricator.wikimedia.org/T403948) (owner: 10Aklapper) [15:27:13] (03CR) 10Dzahn: [C:03+2] "Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Function lookup() did not find a value for the na" [puppet] - 10https://gerrit.wikimedia.org/r/1188815 (https://phabricator.wikimedia.org/T403948) (owner: 10Dzahn) [15:28:01] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device asw1-b13-drmrs [15:28:15] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b13-drmrs [15:28:33] (03CR) 10Majavah: [C:03+2] P:toolforge: Remove support for grid bastions [puppet] - 10https://gerrit.wikimedia.org/r/1012752 (https://phabricator.wikimedia.org/T314665) (owner: 10Majavah) [15:28:59] FIRING: [27x] CertAlmostExpired: Certificate for service asw1-b12-drmrs.mgmt.drmrs.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:29:08] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device asw1-b12-drmrs [15:29:22] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b12-drmrs [15:29:36] andre: 3 different problems :) [15:29:45] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr2-esams [15:29:58] mutante, sorry, I did not see that can of worms coming and thought it's gonna be trivial :( [15:30:08] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-esams [15:30:19] if you want to turn that into a phab task feel free to I guess [15:30:26] (03CR) 10Majavah: [C:03+2] P:toolforge: Cleanup buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188784 (owner: 10Majavah) [15:30:46] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device asw1-bw27-esams [15:30:49] andre: no blame! just sharing. I will leave comments [15:30:59] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-bw27-esams [15:31:07] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr2-drmrs [15:31:25] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-drmrs [15:31:30] sorry for the spam with these cookbook runs for certs [15:32:28] (03PS1) 10Dzahn: Revert "phabricator: remove defunct ElasticSearch backend settings" [puppet] - 10https://gerrit.wikimedia.org/r/1188820 [15:32:51] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr1-esams [15:33:09] (03PS1) 10Dzahn: Revert "phabricator: remove defunct ElasticSearch backend settings" [puppet] - 10https://gerrit.wikimedia.org/r/1188821 [15:33:16] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-esams [15:33:18] jouncebot: nowandnext [15:33:19] For the next 0 hour(s) and 26 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1500) [15:33:19] In 0 hour(s) and 26 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1600) [15:33:27] I'd like to deploy a MediaWiki patch [15:33:40] (03Abandoned) 10Dzahn: Revert "phabricator: remove defunct ElasticSearch backend settings" [puppet] - 10https://gerrit.wikimedia.org/r/1188821 (owner: 10Dzahn) [15:33:59] FIRING: [27x] CertAlmostExpired: Certificate for service asw1-b12-drmrs.mgmt.drmrs.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:34:00] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:08] I'm in the queue to deploy a security patch [15:34:09] (03PS1) 10Dzahn: Revert "phabricator: drop elasticsearch settings in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1188822 [15:34:13] (03PS1) 10Kosta Harlan: hCaptcha: Enable version pinning and subresource integrity [extensions/ConfirmEdit] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188823 (https://phabricator.wikimedia.org/T404251) [15:34:19] There is already a scap backport happening [15:34:23] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr1-drmrs [15:34:29] (03PS1) 10Kosta Harlan: hCaptcha: Enable version pinning and subresource integrity [extensions/ConfirmEdit] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1188824 (https://phabricator.wikimedia.org/T404251) [15:34:41] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-drmrs [15:34:46] (03PS1) 10Scott French: shellbox: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188817 (https://phabricator.wikimedia.org/T403284) [15:34:47] (03PS1) 10Scott French: shellbox-timeline: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188818 (https://phabricator.wikimedia.org/T403284) [15:34:48] (03PS1) 10Scott French: shellbox-video: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188819 (https://phabricator.wikimedia.org/T403284) [15:34:56] !log jhancock@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host mc-misc2001 [15:34:59] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr3-eqsin [15:35:06] !log jhancock@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mc-misc2001 [15:35:24] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:35:33] Dreamy_Jazz: ack, please ping me when you're done [15:35:41] Sure [15:35:43] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr3-eqsin [15:35:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:36:01] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr4-ulsfo [15:36:02] (03CR) 10Dzahn: [C:03+2] "a couple other things are needed here:" [puppet] - 10https://gerrit.wikimedia.org/r/1185884 (https://phabricator.wikimedia.org/T403948) (owner: 10Aklapper) [15:36:10] urbanecm: Mind pinging me when you are done? [15:36:13] sure [15:36:18] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr4-ulsfo [15:36:32] (03CR) 10Dzahn: [C:03+2] Revert "phabricator: remove defunct ElasticSearch backend settings" [puppet] - 10https://gerrit.wikimedia.org/r/1188820 (owner: 10Dzahn) [15:36:45] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr3-ulsfo [15:36:59] 10SRE-SLO, 10Charts, 06Reader Growth Team: Finalize Charts SLO - https://phabricator.wikimedia.org/T399613#11185664 (10CDanis) Luca, do you want an early test subject for the Sloth trial? [15:37:07] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr3-ulsfo [15:37:13] (03CR) 10Dzahn: [C:03+2] Revert "phabricator: drop elasticsearch settings in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1188822 (owner: 10Dzahn) [15:37:45] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-f3-eqiad [15:37:51] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f3-eqiad [15:38:07] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-f2-eqiad [15:38:13] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f2-eqiad [15:38:20] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-e3-eqiad [15:38:25] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e3-eqiad [15:38:31] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-e2-eqiad [15:38:37] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e2-eqiad [15:38:45] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-e1-eqiad [15:38:50] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e1-eqiad [15:39:00] FIRING: [27x] CertAlmostExpired: Certificate for service asw1-b12-drmrs.mgmt.drmrs.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:39:07] (03PS1) 10Majavah: backy2: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188825 [15:39:07] (03PS1) 10Majavah: ceph: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188826 [15:39:07] (03PS1) 10Majavah: P:toolforge::checker: Remove absent checks [puppet] - 10https://gerrit.wikimedia.org/r/1188827 [15:39:08] (03PS1) 10Majavah: P:wmcs::metricsinfra: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188828 [15:39:08] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cloudsw1-f4-eqiad [15:39:09] (03PS1) 10Majavah: P:toolforge::prometheus: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188829 [15:39:14] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw1-f4-eqiad [15:39:20] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cloudsw1-e4-eqiad [15:39:25] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw1-e4-eqiad [15:39:37] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cloudsw1-b1-codfw [15:39:46] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw1-b1-codfw [15:39:56] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr1-eqiad [15:39:58] (03CR) 10CI reject: [V:04-1] ceph: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188826 (owner: 10Majavah) [15:40:08] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:40:10] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-eqiad [15:40:25] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cloudsw1-c8-eqiad [15:40:29] (03CR) 10CI reject: [V:04-1] P:wmcs::metricsinfra: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188828 (owner: 10Majavah) [15:40:36] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw1-c8-eqiad [15:40:49] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-f1-eqiad [15:40:54] (03PS2) 10Scott French: shellbox: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188817 (https://phabricator.wikimedia.org/T403284) [15:40:54] (03PS2) 10Scott French: shellbox-timeline: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188818 (https://phabricator.wikimedia.org/T403284) [15:40:54] (03PS2) 10Scott French: shellbox-video: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188819 (https://phabricator.wikimedia.org/T403284) [15:40:55] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-eqiad [15:41:05] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cloudsw1-d5-eqiad [15:41:16] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw1-d5-eqiad [15:41:25] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:41:35] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr2-eqiad [15:41:38] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage [15:41:43] (03PS2) 10Majavah: ceph: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188826 [15:41:49] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-eqiad [15:42:28] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS1 Status - issue on wikikube-worker2324:9290 - https://phabricator.wikimedia.org/T404480#11185695 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:42:30] (03CR) 10CI reject: [V:04-1] ceph: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188826 (owner: 10Majavah) [15:42:42] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage [15:42:43] (03CR) 10Majavah: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1188826 (owner: 10Majavah) [15:44:01] (03CR) 10RLazarus: [C:03+1] shellbox: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188817 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [15:44:08] (03CR) 10RLazarus: [C:03+1] shellbox-timeline: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188818 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [15:44:14] (03CR) 10RLazarus: [C:03+1] shellbox-video: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188819 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [15:44:27] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1052.eqiad.wmnet with reason: host reimage [15:45:02] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage [15:45:25] jhancock@cumin1002 provision (PID 1127341) is awaiting input [15:45:34] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1188807 (owner: 10Majavah) [15:45:41] (03CR) 10Majavah: [C:03+2] P:toolforge: Delete cmd_checklist test suite [puppet] - 10https://gerrit.wikimedia.org/r/1188807 (owner: 10Majavah) [15:47:09] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11185731 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:48:32] urbanecm: I guess this is still going because it modified i18n? [15:48:39] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1186649 (owner: 10Andrew Bogott) [15:48:55] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1052.eqiad.wmnet with reason: host reimage [15:49:34] I need to go, so kostajh you've moved forward in the queue [15:49:51] I'll leave the security patch till later [15:51:04] 10SRE-SLO, 10Charts, 06Reader Growth Team: Finalize Charts SLO - https://phabricator.wikimedia.org/T399613#11185773 (10elukey) >>! In T399613#11185664, @CDanis wrote: > Luca, do you want an early test subject for the Sloth trial? Definitely, the first use case will be Citoid so we can make a comparison with... [15:52:25] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage [15:54:37] Dreamy_Jazz: likely [15:55:18] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:58:00] (03CR) 10Btullis: "The values themselves look good, but you haven't enabled the installation for the dse-k8s-codfw cluster." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188754 (https://phabricator.wikimedia.org/T404576) (owner: 10Stevemunene) [16:00:05] jhathaway and moritzm: Your horoscope predicts another Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1600). [16:00:05] zabe: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:31] o/ [16:00:59] o/ [16:01:17] there are some MW patches going out currently, as a heads up ^ [16:02:36] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bookworm [16:02:49] (03CR) 10JHathaway: [C:03+2] Add Apache configuration for Wikimedia Thailand wiki [puppet] - 10https://gerrit.wikimedia.org/r/1187539 (https://phabricator.wikimedia.org/T400001) (owner: 10Zabe) [16:04:09] !log urbanecm@deploy1003 sync-world failed: Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.45.0-wmf.17,1.45.0-wmf.18,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted/ [16:04:09] mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.210.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/medi [16:04:09] awiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.210.0) (duration: 37m 39s) [16:04:22] what the hell? [16:04:22] zabe: patch merged [16:04:33] jhathaway: thx :) [16:05:03] jhathaway: might the merge interfere with the scap (that was running from before)? or is that unrelated? [16:06:10] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1052.eqiad.wmnet with OS bookworm [16:06:40] urbanecm: not sure [16:06:56] urbanecm: I think it would have to be that the patch got merged *and* puppet ran on the deployment host, for there to be any possible effect [16:07:07] fair [16:07:14] this seems to be the key part of the log https://www.irccloud.com/pastebin/6OvUSnro/ [16:07:42] on deploy1003 it last finished at 15:54, well before the +2 [16:07:49] yeah [16:08:15] soooo [16:08:40] it did renew some certificates, for mw-experimental / mw-experimental-deploy [16:09:03] I don't know if scap uses those? [16:09:21] it seems pushing to docker-registry failed [16:09:32] I don't think that would have used those certs [16:09:46] FIRING: Emergency syslog message: Alert for device pfw1-codfw.wikimedia.org - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [16:09:46] FIRING: Emergency syslog message: Alert for device pfw1-codfw.wikimedia.org - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [16:09:55] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bookworm [16:11:04] i can also try again and hope the push'll work on second try :-/ [16:12:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11185881 (10RobH) I overthought this, we should just move them with an SFP-T to the new port and worry about reimage and migration to full 10G later. [16:13:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11185897 (10RobH) [16:13:42] cdanis: any objections to that? or do you want to look into where this occured? [16:14:22] urandom: no objections [16:14:40] nothing jumping out at me in https://grafana.wikimedia.org/d/StcefURWz/docker-registry?orgId=1&from=now-3h&to=now&timezone=utc&var-datasource=000000006&var-instance=$__all either [16:15:06] ack, restarting [16:15:39] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1188767|feat: Allow communities to opt out experienced users from mentorship (T403563)]] [16:15:43] T403563: Do not automatically enroll experienced editors into Mentorship when they visit the Homepage - https://phabricator.wikimedia.org/T403563