[00:00:33] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.b2 [00:00:35] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.b3 [00:03:31] FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1163907 [00:08:37] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1163907 (owner: 10TrainBranchBot) [00:12:01] (03PS2) 10Andrea Denisse: centrallog: Disable temporary rsyslog debug config file. [puppet] - 10https://gerrit.wikimedia.org/r/1163901 (https://phabricator.wikimedia.org/T383309) [00:13:29] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [00:14:51] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.b3 [00:14:54] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.b4 [00:24:16] (03Abandoned) 10Andrea Denisse: centrallog: Disable temporary rsyslog debug config file. [puppet] - 10https://gerrit.wikimedia.org/r/1163901 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [00:28:25] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.b4 [00:28:27] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.b5 [00:30:47] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1163907 (owner: 10TrainBranchBot) [00:39:21] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:40:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:43:30] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.b5 [00:43:32] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.b6 [00:55:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:56:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:58:09] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.b6 [00:58:12] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.b7 [00:58:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:58:57] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [00:59:39] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm [01:00:04] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [01:01:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:12:52] !log mvernon@cumin1002 END (FAIL) - Cookbook sre.swift.check-dbs (exit_code=1) Checking container DBs of wikipedia-commons-local-thumb.b7 [01:12:55] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.b8 [01:13:24] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [01:13:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [01:16:17] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [01:16:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [01:21:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [01:23:28] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [01:25:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [01:27:12] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.b8 [01:27:15] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.b9 [01:31:33] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2001.codfw.wmnet with OS bookworm [01:40:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [01:41:40] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.b9 [01:41:43] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.ba [01:46:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [01:54:12] (03CR) 10Abijeet Patro: "This patch caused an issue that needed to be fixed in a follow up: 1163419: Desktop editor: Instrumentation provider should not be used el" [extensions/ContentTranslation] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163845 (https://phabricator.wikimedia.org/T395493) (owner: 10Sbisson) [01:55:08] (03Abandoned) 10Abijeet Patro: Mobile editor: restore VE toolbar position [extensions/ContentTranslation] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163814 (https://phabricator.wikimedia.org/T397840) (owner: 10Abijeet Patro) [01:57:15] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.ba [01:57:17] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.bb [02:11:02] !log mvernon@cumin1002 END (FAIL) - Cookbook sre.swift.check-dbs (exit_code=1) Checking container DBs of wikipedia-commons-local-thumb.bb [02:11:04] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.bc [02:11:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [02:12:21] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [02:26:26] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.bc [02:26:29] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.bd [02:27:21] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [02:30:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [02:40:16] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.bd [02:40:19] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.be [02:41:59] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment mobileapps-production in mobileapps at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [02:45:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [02:51:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [02:54:11] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.be [02:54:13] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.bf [03:08:24] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.bf [03:08:27] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.c0 [03:22:42] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.c0 [03:22:45] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.c1 [03:28:23] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [03:31:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [03:33:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [03:36:55] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.c1 [03:36:58] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.c2 [03:42:30] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [03:42:43] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [03:42:52] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2001.codfw.wmnet with OS bookworm [03:43:28] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [03:45:41] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [03:46:20] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3548 MB (3% inode=98%): /tmp 3548 MB (3% inode=98%): /var/tmp 3548 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [03:51:51] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.c2 [03:51:54] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.c3 [03:55:35] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [04:02:10] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [04:02:15] FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:03:31] FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:05:57] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.c3 [04:06:00] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.c4 [04:07:15] RESOLVED: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:08:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [04:12:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [04:13:30] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:17:54] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2001.codfw.wmnet with OS bookworm [04:20:12] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.c4 [04:20:14] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.c5 [04:22:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [04:23:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [04:26:20] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3485 MB (3% inode=98%): /tmp 3485 MB (3% inode=98%): /var/tmp 3485 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [04:35:16] FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:35:27] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.c5 [04:35:29] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.c6 [04:40:15] RESOLVED: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:40:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:42:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:43:16] FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:48:15] RESOLVED: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:49:50] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.c6 [04:49:52] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.c7 [04:50:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:52:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:01:43] FIRING: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:01:48] FIRING: [7x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:02:03] FIRING: [2x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:02:08] FIRING: [2x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:03:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:04:21] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:04:32] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.c7 [05:04:35] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.c8 [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:21] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:21:30] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.c8 [05:21:33] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.c9 [05:35:39] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.c9 [05:35:42] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.ca [05:39:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:41:48] FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:44:58] (03CR) 10Arnaudb: [C:03+1] "good idea, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1163816 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth) [05:49:23] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.ca [05:49:26] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.cb [05:49:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:51:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:52:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:57:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:58:21] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T0600) [06:00:05] marostegui, Amir1, and federico3: It is that lovely time of the day again! You are hereby commanded to deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T0600). [06:03:10] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.cb [06:03:12] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.cc [06:03:21] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:07:43] (03PS1) 10Muehlenhoff: Failover IDP [dns] - 10https://gerrit.wikimedia.org/r/1163919 [06:10:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:15:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:17:15] FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:18:43] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.cc [06:18:46] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.cd [06:22:15] RESOLVED: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:23:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:25:13] (03CR) 10Muehlenhoff: [C:03+2] Failover IDP [dns] - 10https://gerrit.wikimedia.org/r/1163919 (owner: 10Muehlenhoff) [06:25:18] (03PS2) 10Muehlenhoff: Failover IDP [dns] - 10https://gerrit.wikimedia.org/r/1163919 [06:26:01] (03PS3) 10Anzx: IP cap lift for Wikipedia Edit-A-Thon - Fernando Garcia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163920 (https://phabricator.wikimedia.org/T397720) [06:26:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 26 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163920 (https://phabricator.wikimedia.org/T397720) (owner: 10Anzx) [06:28:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:30:30] (03CR) 10Muehlenhoff: "Looks good in general, two nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/1163850 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [06:30:50] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Failover IDP [dns] - 10https://gerrit.wikimedia.org/r/1163919 (owner: 10Muehlenhoff) [06:30:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:31:01] !log jmm@dns1004 START - running authdns-update [06:32:01] !log jmm@dns1004 END - running authdns-update [06:32:01] (03Abandoned) 10Anzx: IP cap lift for Wikipedia Edit-A-Thon - Fernando Garcia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163920 (https://phabricator.wikimedia.org/T397720) (owner: 10Anzx) [06:32:08] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.cd [06:32:11] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.ce [06:34:15] FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:36:33] (03PS1) 10Muehlenhoff: Record extended contract date for dani [puppet] - 10https://gerrit.wikimedia.org/r/1164029 [06:38:13] (03PS1) 10Jelto: aptrepo: upgrade gitlab-ce and gitlab-runner to 17.11 [puppet] - 10https://gerrit.wikimedia.org/r/1164031 (https://phabricator.wikimedia.org/T397899) [06:39:09] !log jmm@dns1004 START - running authdns-update [06:39:15] RESOLVED: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:39:16] (03CR) 10Arnaudb: [C:03+1] aptrepo: upgrade gitlab-ce and gitlab-runner to 17.11 [puppet] - 10https://gerrit.wikimedia.org/r/1164031 (https://phabricator.wikimedia.org/T397899) (owner: 10Jelto) [06:40:05] !log jmm@dns1004 END - running authdns-update [06:41:59] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment mobileapps-production in mobileapps at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [06:44:15] (03CR) 10Muehlenhoff: [C:03+2] Record extended contract date for dani [puppet] - 10https://gerrit.wikimedia.org/r/1164029 (owner: 10Muehlenhoff) [06:45:46] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:45:52] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:46:04] (03PS1) 10Fabfur: data: remove users from analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1164037 (https://phabricator.wikimedia.org/T397850) [06:46:27] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.ce [06:46:30] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.cf [06:46:36] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:46:42] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54082 bytes in 0.120 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:49:37] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1178.eqiad.wmnet [06:50:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:51:41] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1178.eqiad.wmnet [06:52:05] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1180.eqiad.wmnet [06:53:58] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1180.eqiad.wmnet [06:54:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:55:05] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1181.eqiad.wmnet [06:57:02] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1181.eqiad.wmnet [06:57:24] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1182.eqiad.wmnet [06:58:18] (03CR) 10Majavah: [C:04-1] keystone policy: allow object_storage role to create/delete ec2 creds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1163864 (https://phabricator.wikimedia.org/T396594) (owner: 10Andrew Bogott) [06:59:52] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1182.eqiad.wmnet [07:00:04] Amir1, Urbanecm, and awight: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:00:38] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.cf [07:00:41] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.d0 [07:01:00] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1183.eqiad.wmnet [07:03:02] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1183.eqiad.wmnet [07:04:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [07:04:54] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1184.eqiad.wmnet [07:05:14] (03PS3) 10Jcrespo: dbbackups: Disable read only backups and reenable regular rw es backups [puppet] - 10https://gerrit.wikimedia.org/r/1163694 (https://phabricator.wikimedia.org/T387892) [07:06:43] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1184.eqiad.wmnet [07:06:48] (03CR) 10Muehlenhoff: Unvendor Bootstrap (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [07:06:55] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1185.eqiad.wmnet [07:07:28] (03CR) 10Ayounsi: [C:03+1] Add vendor exclusion to DHCPConfMac [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163839 (owner: 10JHathaway) [07:08:45] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1185.eqiad.wmnet [07:10:08] (03PS5) 10Muehlenhoff: Depend on libjs-bootstrap4 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1163741 (https://phabricator.wikimedia.org/T397696) [07:10:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [07:11:05] (03PS13) 10Muehlenhoff: Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) [07:11:44] (03CR) 10Stevemunene: [C:03+2] hdfs: Add new worker hosts to net_topology [puppet] - 10https://gerrit.wikimedia.org/r/1163777 (https://phabricator.wikimedia.org/T397615) (owner: 10Stevemunene) [07:11:53] (03PS14) 10Muehlenhoff: Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) [07:15:04] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.d0 [07:15:06] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.d1 [07:15:51] (03CR) 10Volans: [C:03+2] Add vendor exclusion to DHCPConfMac [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163839 (owner: 10JHathaway) [07:16:43] FIRING: [7x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:16:48] FIRING: [2x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:16:53] (03CR) 10Volans: [C:03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [07:20:21] (03PS32) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [07:20:26] (03CR) 10Slyngshede: [C:03+1] "LGTM, the three users not in the patch are already marked as absent." [puppet] - 10https://gerrit.wikimedia.org/r/1164037 (https://phabricator.wikimedia.org/T397850) (owner: 10Fabfur) [07:21:35] (03PS1) 10Arnaudb: gerrit: add readonly.jar [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1164044 (https://phabricator.wikimedia.org/T387833) [07:21:35] (03CR) 10Arnaudb: "1159395 is now merged, we'll need to deploy the readonly plugin on our instances to be able to enable it. It is disabled par default and w" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1164044 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [07:22:19] (03PS33) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [07:25:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [07:25:52] (03Merged) 10jenkins-bot: Add vendor exclusion to DHCPConfMac [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163839 (owner: 10JHathaway) [07:26:43] RESOLVED: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:26:48] FIRING: [2x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:27:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [07:28:36] (03PS15) 10Vgutierrez: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) [07:29:29] (03CR) 10Vgutierrez: "thanks sukhe for taking care of debugging the nil element "bug" <3" [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [07:29:36] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.d1 [07:29:39] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.d2 [07:30:34] jouncebot: nowandnext [07:30:34] For the next 0 hour(s) and 29 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T0700) [07:30:34] In 0 hour(s) and 29 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T0800) [07:30:44] (03PS2) 10Abban Dunne: Add WMDE Fundraising banner event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148322 [07:31:43] RESOLVED: [7x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:31:43] RESOLVED: [2x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:31:49] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [07:31:53] RESOLVED: [2x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:33:17] (03CR) 10Vgutierrez: [C:04-1] "as mentioned on the other CR, let's go with `check_ssl_cdn`" [puppet] - 10https://gerrit.wikimedia.org/r/1163843 (owner: 10Ssingh) [07:34:01] (03CR) 10Vgutierrez: [C:03+1] hiera: cache/{text,upload}: use aliases for SANs [puppet] - 10https://gerrit.wikimedia.org/r/1163837 (owner: 10Ssingh) [07:34:30] (03PS3) 10Abban Dunne: Add WMDE Fundraising banner event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148322 [07:38:49] (03CR) 10Vgutierrez: [C:03+1] P:cache::haproxy: properly indent profile (NOOP) [puppet] - 10https://gerrit.wikimedia.org/r/1163842 (owner: 10Ssingh) [07:41:56] (03CR) 10Abban Dunne: Add WMDE Fundraising banner event stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148322 (owner: 10Abban Dunne) [07:43:08] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup1009.eqiad.wmnet with reason: Maintenance and reboot [07:43:39] (03CR) 10Elukey: pyrra::filesystem::slos::istio: Fix PromQL to work with istio 1.24 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1163769 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [07:44:25] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.d2 [07:44:27] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.d3 [07:45:41] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [07:55:44] (03PS1) 10Volans: CHANGELOG: add changelogs for release v11.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164121 [07:56:30] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v11.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164121 (owner: 10Volans) [07:57:50] (03CR) 10Muehlenhoff: [C:04-1] "See comments on task" [puppet] - 10https://gerrit.wikimedia.org/r/1164037 (https://phabricator.wikimedia.org/T397850) (owner: 10Fabfur) [08:00:05] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T0800) [08:00:19] !log mvernon@cumin1002 END (FAIL) - Cookbook sre.swift.check-dbs (exit_code=1) Checking container DBs of wikipedia-commons-local-thumb.d3 [08:00:21] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.d4 [08:01:57] wmf.7 has successfully reached group 1 yesterday. There is no train blocker on https://phabricator.wikimedia.org/T392177 [08:02:48] the rest of the wikis will be pushed by jeena later tonight (18:00 UTC) [08:02:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:03:25] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for backup1009.eqiad.wmnet: Renew puppet certificate - jynus@cumin1002 [08:03:26] (03PS4) 10Jcrespo: dbbackups: Disable read only backups and reenable regular rw es backups [puppet] - 10https://gerrit.wikimedia.org/r/1163694 (https://phabricator.wikimedia.org/T387892) [08:03:31] FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:03:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:04:16] (03PS5) 10Jcrespo: dbbackups: Disable read only backups and reenable regular rw es backups [puppet] - 10https://gerrit.wikimedia.org/r/1163694 (https://phabricator.wikimedia.org/T387892) [08:04:24] joelyrookewmde59: I can deploy your patch this morning if you want ( https://gerrit.wikimedia.org/r/c/1163704/ ) [08:04:52] hi hi [08:05:03] I think we're dependent on the train [08:05:06] (03PS2) 10Hashar: Revert^2 "Activate feature to resolve wikibase link labels in pilot wiki changelists" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163704 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE) [08:05:20] that needs to go after all wikis have been upgraded to wmf.7? [08:05:37] yes unfortunately [08:06:08] so we can't do that during the backport window this afternoon [08:06:11] we could cherry pick the crucial part but this can wait until next week [08:06:28] yeah I'll cancel this aft and reschedule [08:06:49] there is another window later tonight at 20:00 UTC / 22:00 UTC+2 [08:06:52] but that is rather late [08:07:12] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v11.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164121 (owner: 10Volans) [08:07:16] we can poke the train blocker of this week and have people deploy the patch for you tonight [08:07:25] hahah yeah I'm in UTC+2 and don't fancy staying later in case of errors [08:07:36] unsurprisingly :]]] [08:07:40] else I am fine deploying it tomorrow morning if needed [08:08:06] that seems rather low risk [08:08:10] I thought friday deployments were not possible except in case of emergency [08:08:15] ? [08:08:20] yeah that is the question :] [08:09:19] ok I will ask my team right now and ping you in 5 with a decision? [08:09:24] the rule of thumb is nobody wants to be called on a Friday evening, or worse on a Saturday morning [08:09:51] so if that feature is really needed we can do it on Friday, but would need some people from SRE to +1 as well [08:10:00] else we do it on Monday morning [08:10:01] (03CR) 10Muehlenhoff: [C:03+2] Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [08:10:25] in both case I am happy to assist with the deployment and following monitoring of log/error rates etc [08:10:41] (03CR) 10Volans: [C:03+1] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1163741 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [08:11:01] (03CR) 10Muehlenhoff: Depend on libjs-bootstrap4 (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1163741 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [08:11:12] (03CR) 10Muehlenhoff: [C:03+2] Depend on libjs-bootstrap4 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1163741 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [08:13:13] Ok so if tonight is an option that would be great. I don't think I can join personally though, so should I write you/ someone else the check we need to do to confirm the feature works? [08:13:30] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:13:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:13:58] joelyrookewmde59: yes having a procedure to check would be IDEAL and I guess that is the primary reason a deployer wants the dev/patch author to be present [08:14:16] but if there is a summary of what the feature does and how to verify it works as intended, a deployer can do it for you [08:14:27] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.d4 [08:14:30] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.d5 [08:16:14] I would: write the check steps on your task T388685, schedule it for tonight, and add a message on the train blocker of this week explaining the Gerrit change has to be deployed after wmf.7 got rolled out and linking to the instruction [08:16:15] T388685: Show labels for properties and items on Wikipedia watchlist summaries - https://phabricator.wikimedia.org/T388685 [08:16:17] the blocker of the week is https://phabricator.wikimedia.org/T392177 [08:16:28] excellent ! [08:16:35] and I think I can attend tonight :] [08:16:50] you're stronger than me :D [08:16:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:17:24] I'll do these notes now :) [08:19:19] joelyrookewmde59: awesome, Danke Schon! [08:20:11] \o/ [08:21:11] (03PS1) 10Slyngshede: Add new Netbox records repo [dns] - 10https://gerrit.wikimedia.org/r/1164124 (https://phabricator.wikimedia.org/T362985) [08:21:15] (03PS1) 10Volans: Upstream release v11.2.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1164125 [08:21:24] (03CR) 10Volans: [C:03+2] Upstream release v11.2.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1164125 (owner: 10Volans) [08:26:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:27:30] (03CR) 10Vgutierrez: "https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/6d4c99c29b19155369889ca16ffd34bd42797c2c%5E%21/#F0 should cover the" [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [08:28:06] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2161.codfw.wmnet with reason: Maintenance [08:29:18] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.d5 [08:29:20] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.d6 [08:30:45] (03CR) 10Jelto: [C:03+2] aptrepo: upgrade gitlab-ce and gitlab-runner to 17.11 [puppet] - 10https://gerrit.wikimedia.org/r/1164031 (https://phabricator.wikimedia.org/T397899) (owner: 10Jelto) [08:31:40] (03Merged) 10jenkins-bot: Upstream release v11.2.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1164125 (owner: 10Volans) [08:32:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:36:27] 06SRE, 10Pywikibot: pywikipedia.org is using wildcard cert for production projects - https://phabricator.wikimedia.org/T388809#10949480 (10Vgutierrez) 05Open→03Resolved acme-chief issued the certificate including pywikipedia.org successfully this time: ` vgutierrez@acmechief2002:~$ sudo -i openssl x509... [08:37:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:38:21] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:43:15] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.d6 [08:43:17] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.d7 [08:43:21] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:43:26] !log uploaded spicerack_11.2.0 to apt.wikimedia.org bullseye-wikimedia,bookworm-wikimedia [08:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:23] !log installed spicerack v11.2.0 on cumin2002 [08:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:21] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:48:47] (03CR) 10Muehlenhoff: [C:03+1] "Looks good to me" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163369 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [08:54:07] (03PS1) 10Filippo Giunchedi: o11y: adjust ThanosSidecarDropQueries threshold [alerts] - 10https://gerrit.wikimedia.org/r/1164129 (https://phabricator.wikimedia.org/T394318) [08:55:06] (03CR) 10Muehlenhoff: [C:03+2] Remove external cloud sync from Puppet 5 frontends [puppet] - 10https://gerrit.wikimedia.org/r/1163399 (owner: 10Muehlenhoff) [08:57:36] (03CR) 10Elukey: [C:03+1] kubernetes: add a new kubernetes section [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163369 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [08:58:01] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.d7 [08:58:04] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.d8 [08:58:21] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:58:38] (03CR) 10Elukey: [C:03+1] kubernetes: add API to update data [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163420 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [08:59:53] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade Replica to GitLab 17.11 [09:03:31] FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:03:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:07:24] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:07:30] (03PS1) 10Effie Mouzeli: mediawiki_experimental: minor fixes [puppet] - 10https://gerrit.wikimedia.org/r/1164133 [09:07:57] (03CR) 10CI reject: [V:04-1] mediawiki_experimental: minor fixes [puppet] - 10https://gerrit.wikimedia.org/r/1164133 (owner: 10Effie Mouzeli) [09:08:31] FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:09:34] (03PS2) 10Effie Mouzeli: mediawiki_experimental: minor fixes [puppet] - 10https://gerrit.wikimedia.org/r/1164133 [09:09:43] !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:09:59] (03CR) 10CI reject: [V:04-1] mediawiki_experimental: minor fixes [puppet] - 10https://gerrit.wikimedia.org/r/1164133 (owner: 10Effie Mouzeli) [09:10:49] (03CR) 10Muehlenhoff: mediawiki_experimental: minor fixes (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1164133 (owner: 10Effie Mouzeli) [09:11:02] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10949566 (10Volans) @Jhancock.wm thanks, I've run the provision cookbook on `cp2044` but is giving me authentication credential error. [09:11:12] (03CR) 10Volans: [C:03+2] kubernetes: add a new kubernetes section (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163369 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [09:11:20] (03CR) 10Volans: [C:03+2] kubernetes: add API to update data [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163420 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [09:11:30] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.d8 [09:11:32] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.d9 [09:12:10] (03CR) 10CI reject: [V:04-1] kubernetes: add API to update data [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163420 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [09:12:33] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10949569 (10Volans) But I've tested the scp_dump that was failing and it's fixed. So I think the provision should work. Feel free to try it (from cumin2002 that has the... [09:12:38] (03PS3) 10Volans: kubernetes: add a new kubernetes section [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163369 (https://phabricator.wikimedia.org/T397696) [09:12:45] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade Replica to GitLab 17.11 [09:13:03] (03PS4) 10Volans: kubernetes: add API to update data [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163420 (https://phabricator.wikimedia.org/T397696) [09:13:16] (03PS3) 10Effie Mouzeli: mediawiki_experimental: minor fixes [puppet] - 10https://gerrit.wikimedia.org/r/1164133 [09:13:41] (03CR) 10CI reject: [V:04-1] mediawiki_experimental: minor fixes [puppet] - 10https://gerrit.wikimedia.org/r/1164133 (owner: 10Effie Mouzeli) [09:14:45] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1041.eqiad.wmnet [09:14:54] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade Replica to GitLab 17.11 [09:16:14] (03PS4) 10Effie Mouzeli: mediawiki_experimental: minor fixes [puppet] - 10https://gerrit.wikimedia.org/r/1164133 [09:16:42] (03CR) 10CI reject: [V:04-1] mediawiki_experimental: minor fixes [puppet] - 10https://gerrit.wikimedia.org/r/1164133 (owner: 10Effie Mouzeli) [09:16:43] (03CR) 10Effie Mouzeli: mediawiki_experimental: minor fixes (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1164133 (owner: 10Effie Mouzeli) [09:18:49] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2161 gradually with 4 steps - Pooling in [09:18:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:20:39] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1041.eqiad.wmnet [09:20:55] (03PS5) 10Effie Mouzeli: mediawiki_experimental: minor fixes [puppet] - 10https://gerrit.wikimedia.org/r/1164133 [09:21:55] (03CR) 10Volans: kubernetes: add a new kubernetes section [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163369 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [09:21:58] (03CR) 10Volans: [C:03+2] kubernetes: add a new kubernetes section [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163369 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [09:22:13] (03CR) 10Volans: kubernetes: add API to update data [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163420 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [09:22:16] (03CR) 10Volans: [C:03+2] kubernetes: add API to update data [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163420 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [09:22:55] (03Merged) 10jenkins-bot: kubernetes: add a new kubernetes section [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163369 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [09:22:56] (03Merged) 10jenkins-bot: kubernetes: add API to update data [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163420 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [09:23:48] (03PS4) 10Filippo Giunchedi: tox: add python3.13 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163723 (https://phabricator.wikimedia.org/T395449) [09:23:48] (03PS3) 10Filippo Giunchedi: icinga: fix mypy call-overload error in icinga.py [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163732 (https://phabricator.wikimedia.org/T395449) [09:23:48] (03CR) 10Filippo Giunchedi: "Thank you, your solution SGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163732 (https://phabricator.wikimedia.org/T395449) (owner: 10Filippo Giunchedi) [09:23:50] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164133 (owner: 10Effie Mouzeli) [09:24:29] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2052.codfw.wmnet [09:24:42] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1040.eqiad.wmnet [09:24:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:25:11] (03PS1) 10Kosta Harlan: signup.js: Fix name used for signup_validate_password [core] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164135 (https://phabricator.wikimedia.org/T397890) [09:25:24] jouncebot: nowandnext [09:25:24] For the next 0 hour(s) and 34 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T0800) [09:25:25] In 0 hour(s) and 34 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T1000) [09:25:35] I will deploy a patch to wmf.7, unless there are any objections [09:25:45] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.d9 [09:25:48] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.da [09:26:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164135 (https://phabricator.wikimedia.org/T397890) (owner: 10Kosta Harlan) [09:27:35] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade Replica to GitLab 17.11 [09:30:24] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2052.codfw.wmnet [09:30:35] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1040.eqiad.wmnet [09:30:49] (03CR) 10Volans: [C:03+1] "LGTM, thx" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163732 (https://phabricator.wikimedia.org/T395449) (owner: 10Filippo Giunchedi) [09:32:25] (03CR) 10Effie Mouzeli: [C:03+2] mediawiki_experimental: minor fixes [puppet] - 10https://gerrit.wikimedia.org/r/1164133 (owner: 10Effie Mouzeli) [09:33:37] (03PS34) 10Slyngshede: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [09:35:09] (03PS4) 10Filippo Giunchedi: icinga: fix mypy call-overload error in icinga.py [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163732 (https://phabricator.wikimedia.org/T395449) [09:35:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163704 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE) [09:35:14] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] icinga: fix mypy call-overload error in icinga.py [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163732 (https://phabricator.wikimedia.org/T395449) (owner: 10Filippo Giunchedi) [09:35:49] (03CR) 10Volans: "Nit inline. It's nice to see that with the recent upgrades for bookworm it doesn't fail anymore." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163723 (https://phabricator.wikimedia.org/T395449) (owner: 10Filippo Giunchedi) [09:36:26] (03CR) 10Filippo Giunchedi: [C:03+2] o11y: adjust ThanosSidecarDropQueries threshold [alerts] - 10https://gerrit.wikimedia.org/r/1164129 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [09:36:35] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1039.eqiad.wmnet [09:38:58] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.da [09:39:01] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.db [09:39:23] (03Merged) 10jenkins-bot: signup.js: Fix name used for signup_validate_password [core] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164135 (https://phabricator.wikimedia.org/T397890) (owner: 10Kosta Harlan) [09:40:01] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [09:40:04] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1164135|signup.js: Fix name used for signup_validate_password (T397890)]] [09:40:10] T397890: wmf.7 - Invalid stat name mediawiki_signup_validatepassword - https://phabricator.wikimedia.org/T397890 [09:40:58] (03PS5) 10Filippo Giunchedi: tox: add python 3.12 and 3.13 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163723 (https://phabricator.wikimedia.org/T395449) [09:41:04] (03CR) 10Filippo Giunchedi: tox: add python 3.12 and 3.13 (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163723 (https://phabricator.wikimedia.org/T395449) (owner: 10Filippo Giunchedi) [09:41:22] (03PS1) 10Hnowlan: mobileapps: set requests == limits for other containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164137 (https://phabricator.wikimedia.org/T397750) [09:41:50] (03PS2) 10Stevemunene: hdfs: set an-worker1176 to reuse recipe [puppet] - 10https://gerrit.wikimedia.org/r/1163692 (https://phabricator.wikimedia.org/T390176) [09:42:11] (03CR) 10Brouberol: [C:03+1] hdfs: set an-worker1176 to reuse recipe [puppet] - 10https://gerrit.wikimedia.org/r/1163692 (https://phabricator.wikimedia.org/T390176) (owner: 10Stevemunene) [09:42:19] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1164135|signup.js: Fix name used for signup_validate_password (T397890)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:42:39] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1039.eqiad.wmnet [09:43:58] (03CR) 10Volans: [C:03+1] "LGTM, thx" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163723 (https://phabricator.wikimedia.org/T395449) (owner: 10Filippo Giunchedi) [09:44:15] !log kharlan@deploy1003 kharlan: Continuing with sync [09:44:52] (03Abandoned) 10Ayounsi: Tox: add Python3.12 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 (owner: 10Ayounsi) [09:45:28] (03Abandoned) 10Ayounsi: reimage: add MAC address support for physical hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1163360 (owner: 10Ayounsi) [09:46:12] (03CR) 10Filippo Giunchedi: [C:03+2] tox: add python 3.12 and 3.13 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163723 (https://phabricator.wikimedia.org/T395449) (owner: 10Filippo Giunchedi) [09:47:09] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: sync [09:47:12] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [09:49:21] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1164135|signup.js: Fix name used for signup_validate_password (T397890)]] (duration: 09m 17s) [09:49:27] T397890: wmf.7 - Invalid stat name mediawiki_signup_validatepassword - https://phabricator.wikimedia.org/T397890 [09:49:29] (03PS1) 10Clément Goubert: O:kubernetes::deployment_server: mariadb client [puppet] - 10https://gerrit.wikimedia.org/r/1164139 (https://phabricator.wikimedia.org/T397017) [09:49:43] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164139 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [09:49:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:50:25] (03CR) 10Clément Goubert: [C:03+1] mobileapps: set requests == limits for other containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164137 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [09:52:57] (03PS2) 10Clément Goubert: O:kubernetes::deployment_server: mariadb client [puppet] - 10https://gerrit.wikimedia.org/r/1164139 (https://phabricator.wikimedia.org/T397017) [09:53:44] (03CR) 10Ladsgroup: [C:03+1] O:kubernetes::deployment_server: mariadb client [puppet] - 10https://gerrit.wikimedia.org/r/1164139 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [09:53:52] (03CR) 10Hnowlan: [C:03+2] mobileapps: set requests == limits for other containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164137 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [09:54:05] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164139 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [09:54:07] (03CR) 10Hnowlan: [C:03+1] O:kubernetes::deployment_server: mariadb client [puppet] - 10https://gerrit.wikimedia.org/r/1164139 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [09:54:18] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.db [09:54:20] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.dc [09:55:39] (03Merged) 10jenkins-bot: mobileapps: set requests == limits for other containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164137 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [09:56:03] Done deploying [09:56:45] (03Merged) 10jenkins-bot: tox: add python 3.12 and 3.13 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163723 (https://phabricator.wikimedia.org/T395449) (owner: 10Filippo Giunchedi) [09:58:04] (03CR) 10Ayounsi: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800 (owner: 10Ayounsi) [09:58:53] !log ayounsi@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [09:58:53] (03CR) 10Clément Goubert: [C:03+2] O:kubernetes::deployment_server: mariadb client [puppet] - 10https://gerrit.wikimedia.org/r/1164139 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [09:59:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T1000) [10:04:16] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2161 gradually with 4 steps - Pooling in [10:05:06] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:05:36] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:06:17] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [10:07:14] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [10:09:12] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.dc [10:09:15] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.dd [10:09:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:10:55] !log ayounsi@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [10:11:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment mobileapps-production in mobileapps at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [10:14:40] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [10:19:44] !log dropping job table on all wikis (T397367) [10:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:51] T397367: Drop unneeded empty tables from wikis - https://phabricator.wikimedia.org/T397367 [10:23:10] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.dd [10:23:13] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.de [10:23:35] (03Abandoned) 10Fabfur: data: remove users from analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1164037 (https://phabricator.wikimedia.org/T397850) (owner: 10Fabfur) [10:23:43] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s4 T397163 [10:23:48] T397163: Switchover s4 master (db2240 -> db2179) - https://phabricator.wikimedia.org/T397163 [10:24:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Set db2179 with weight 0 T397163', diff saved to https://phabricator.wikimedia.org/P78702 and previous config saved to /var/cache/conftool/dbconfig/20250626-102415-fceratto.json [10:24:28] !log dropping l10n_cache table in group0 wikis [10:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Remove db2179 from API/vslow/dump T397163', diff saved to https://phabricator.wikimedia.org/P78703 and previous config saved to /var/cache/conftool/dbconfig/20250626-102533-fceratto.json [10:26:20] (03PS1) 10Zabe: beta: Set categorylinks to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164141 (https://phabricator.wikimedia.org/T397912) [10:26:42] (03PS2) 10Elukey: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 [10:28:38] (03CR) 10CI reject: [V:04-1] WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (owner: 10Elukey) [10:28:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:29:50] (03PS3) 10Elukey: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 [10:30:03] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2001.codfw.wmnet with OS bookworm [10:30:52] (03PS4) 10Elukey: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 [10:31:10] (03PS2) 10Gerrit maintenance bot: mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1160096 (https://phabricator.wikimedia.org/T397163) [10:31:43] (03CR) 10FedericoCeratto: [C:03+1] "Automatically generated PR: no other approval needed." [puppet] - 10https://gerrit.wikimedia.org/r/1160096 (https://phabricator.wikimedia.org/T397163) (owner: 10Gerrit maintenance bot) [10:32:42] (03CR) 10Ladsgroup: [C:03+1] "Did we run the migration script?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164141 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [10:32:52] (03CR) 10CI reject: [V:04-1] WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (owner: 10Elukey) [10:33:28] (03CR) 10Zabe: "no not yet, but will do before merging this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164141 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [10:38:09] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.de [10:38:12] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.df [10:40:20] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1160096 (https://phabricator.wikimedia.org/T397163) (owner: 10Gerrit maintenance bot) [10:40:36] (03PS1) 10GergesShamon: [arwikiversity] fix wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164145 (https://phabricator.wikimedia.org/T397845) [10:41:28] !log Starting s4 codfw failover from db2240 to db2179 - T397163 [10:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:33] T397163: Switchover s4 master (db2240 -> db2179) - https://phabricator.wikimedia.org/T397163 [10:42:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164145 (https://phabricator.wikimedia.org/T397845) (owner: 10GergesShamon) [10:42:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Promote db2179 to s4 primary T397163', diff saved to https://phabricator.wikimedia.org/P78704 and previous config saved to /var/cache/conftool/dbconfig/20250626-104226-fceratto.json [10:44:35] (03PS1) 10Ayounsi: reimage: don't stop if FQDN is used instead of hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/1164147 [10:44:40] (03PS2) 10Stevemunene: hdfs: Assign the right role to new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1163778 (https://phabricator.wikimedia.org/T397615) [10:48:28] (03CR) 10Stevemunene: hdfs: Assign the right role to new hadoop workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1163778 (https://phabricator.wikimedia.org/T397615) (owner: 10Stevemunene) [10:48:44] (03PS1) 10Effie Mouzeli: debug.json: remove mwdebugX hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164148 (https://phabricator.wikimedia.org/T397498) [10:51:05] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2240.codfw.wmnet with reason: Maintenance [10:51:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2240 (T395241)', diff saved to https://phabricator.wikimedia.org/P78705 and previous config saved to /var/cache/conftool/dbconfig/20250626-105112-fceratto.json [10:51:32] (03PS1) 10Btullis: Dumps_v1: Disable the sync job that publishes from dumpsdata servers [puppet] - 10https://gerrit.wikimedia.org/r/1164150 (https://phabricator.wikimedia.org/T397848) [10:51:55] (03PS2) 10Acamicamacaraca: Disable translations in sh-latn and sh-cyrl (wgTranslateDisabledTargetLanguages) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164146 (https://phabricator.wikimedia.org/T397913) [10:52:40] jouncebot: next [10:52:40] In 1 hour(s) and 7 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T1200) [10:53:38] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.df [10:53:40] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.e0 [10:53:46] (03CR) 10JMeybohm: [V:03+1] pyrra::filesystem::slos::istio: Fix PromQL to work with istio 1.24 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1163769 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [10:54:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164148 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [10:54:25] (03CR) 10JMeybohm: [C:03+2] sre.wipe-cluster: Ask user to confirm target k8s version [cookbooks] - 10https://gerrit.wikimedia.org/r/1163402 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [10:54:27] (03CR) 10JMeybohm: [C:03+2] k8s.wipe-cluster: Run puppet in batches of 50 [cookbooks] - 10https://gerrit.wikimedia.org/r/1163401 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [10:59:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T395241)', diff saved to https://phabricator.wikimedia.org/P78706 and previous config saved to /var/cache/conftool/dbconfig/20250626-105940-fceratto.json [10:59:55] (03CR) 10Effie Mouzeli: [C:03+2] hieradata: make wikikube-worker2100 a normal worker [puppet] - 10https://gerrit.wikimedia.org/r/1159519 (owner: 10Effie Mouzeli) [11:00:07] (03PS2) 10Effie Mouzeli: hieradata: make wikikube-worker2100 a normal worker [puppet] - 10https://gerrit.wikimedia.org/r/1159519 [11:00:20] (03PS5) 10Ayounsi: reimage: add MAC address support for physical hosts - try #2 [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800 [11:00:20] (03PS1) 10Ayounsi: reimage: temporarily store the MAC in Netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/1164151 [11:01:30] (03Merged) 10jenkins-bot: k8s.wipe-cluster: Run puppet in batches of 50 [cookbooks] - 10https://gerrit.wikimedia.org/r/1163401 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [11:01:31] (03Merged) 10jenkins-bot: sre.wipe-cluster: Ask user to confirm target k8s version [cookbooks] - 10https://gerrit.wikimedia.org/r/1163402 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [11:01:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164146 (https://phabricator.wikimedia.org/T397913) (owner: 10Acamicamacaraca) [11:01:46] (03PS1) 10Volans: Upstream release v0.6.0 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1164152 [11:02:24] (03PS1) 10Muehlenhoff: Support passing multiple servers [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1164153 (https://phabricator.wikimedia.org/T397696) [11:02:26] (03CR) 10Effie Mouzeli: [C:03+2] hieradata: make wikikube-worker2100 a normal worker [puppet] - 10https://gerrit.wikimedia.org/r/1159519 (owner: 10Effie Mouzeli) [11:02:37] !log ayounsi@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2007.codfw.wmnet with OS bookworm [11:03:16] (03PS1) 10Jgiannelos: changeprop: Debug if-unmodified-since impact on PCS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164154 (https://phabricator.wikimedia.org/T397750) [11:03:54] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1038.eqiad.wmnet [11:04:02] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2053.codfw.wmnet [11:04:18] (03CR) 10CI reject: [V:04-1] Support passing multiple servers [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1164153 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [11:04:48] (03PS35) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [11:06:50] (03PS36) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [11:07:24] (03PS1) 10Jforrester: [BETA CLUSTER] Stop loading VueTest, we're dropping it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164155 (https://phabricator.wikimedia.org/T357475) [11:07:25] (03PS1) 10Jforrester: Drop ability to use VueTest on a wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164156 (https://phabricator.wikimedia.org/T357475) [11:07:35] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.e0 [11:07:38] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.e1 [11:08:48] (03CR) 10Volans: [C:04-1] "I think it can be simplified without repeating a lot of steps" [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1164153 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [11:09:58] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2053.codfw.wmnet [11:10:06] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1038.eqiad.wmnet [11:11:55] (03CR) 10Volans: reimage: temporarily store the MAC in Netbox (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1164151 (owner: 10Ayounsi) [11:12:12] (03CR) 10Volans: [C:03+2] Upstream release v0.6.0 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1164152 (owner: 10Volans) [11:12:59] (03Merged) 10jenkins-bot: Upstream release v0.6.0 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1164152 (owner: 10Volans) [11:14:04] (03PS1) 10Btullis: Dumps_v1: Stop updating dumps monitor HTML/JSON from the legacy system [puppet] - 10https://gerrit.wikimedia.org/r/1164157 (https://phabricator.wikimedia.org/T397848) [11:14:08] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [11:15:11] (03PS1) 10Jcrespo: bacula: Create a temporary backup job for long term Archival [puppet] - 10https://gerrit.wikimedia.org/r/1164158 (https://phabricator.wikimedia.org/T387892) [11:15:29] (03CR) 10Ayounsi: reimage: temporarily store the MAC in Netbox (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1164151 (owner: 10Ayounsi) [11:16:13] (03PS2) 10Ayounsi: reimage: temporarily store the MAC in Netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/1164151 [11:18:37] !log uploaded debmonitor-server,python3-debmonitor_0.6.0 to apt.wikimedia.org bookworm-wikimedia [11:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:01] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1037.eqiad.wmnet [11:19:38] (03PS37) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [11:20:57] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.e1 [11:21:00] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.e2 [11:21:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment mobileapps-production in mobileapps at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [11:22:29] (03PS1) 10Volans: Fix .wmfconfig settings [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164162 [11:24:38] (03PS38) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [11:24:54] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1037.eqiad.wmnet [11:25:24] (03CR) 10Volans: "Tested for the 0.6.0 release together with:" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164162 (owner: 10Volans) [11:25:38] (03PS39) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [11:25:56] (03PS2) 10Jcrespo: bacula: Create a temporary backup job for long term Archival [puppet] - 10https://gerrit.wikimedia.org/r/1164158 (https://phabricator.wikimedia.org/T387892) [11:28:40] (03CR) 10Jforrester: "> {{Done}}" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [11:28:52] (03CR) 10Vgutierrez: [C:03+2] liberica: Don't start liberica-cp on system boot [puppet] - 10https://gerrit.wikimedia.org/r/1163715 (https://phabricator.wikimedia.org/T396398) (owner: 10Vgutierrez) [11:29:51] (03PS2) 10Muehlenhoff: Support passing multiple servers [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1164153 (https://phabricator.wikimedia.org/T397696) [11:31:54] (03CR) 10CI reject: [V:04-1] Support passing multiple servers [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1164153 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [11:32:45] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [11:32:54] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2054.codfw.wmnet [11:33:17] (03CR) 10Muehlenhoff: Support passing multiple servers (035 comments) [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1164153 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [11:35:10] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#10950029 (10Jdforrester-WMF) [11:36:39] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.e2 [11:36:42] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.e3 [11:38:10] (03PS3) 10Jcrespo: bacula: Create a temporary backup job for long term Archival [puppet] - 10https://gerrit.wikimedia.org/r/1164158 (https://phabricator.wikimedia.org/T387892) [11:38:20] (03PS4) 10Jcrespo: bacula: Create a temporary backup job for long term Archival [puppet] - 10https://gerrit.wikimedia.org/r/1164158 (https://phabricator.wikimedia.org/T387892) [11:38:55] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2054.codfw.wmnet [11:39:02] (03PS1) 10Esanders: Force-clear toolbar after teardown [extensions/VisualEditor] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164165 (https://phabricator.wikimedia.org/T397914) [11:40:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/VisualEditor] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164165 (https://phabricator.wikimedia.org/T397914) (owner: 10Esanders) [11:41:02] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [11:41:12] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [11:41:48] (03CR) 10Hnowlan: "lgtm, needs a chart bump but otherwise shipit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164154 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [11:43:51] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet [11:44:02] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2019.codfw.wmnet [11:44:19] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet [11:44:37] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10950039 (10ops-monitoring-bot) Draining ganeti2022.codfw.wmnet of running VMs [11:44:46] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164168 [11:45:41] (03PS2) 10Jgiannelos: changeprop: Debug if-unmodified-since impact on PCS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164154 (https://phabricator.wikimedia.org/T397750) [11:45:41] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:45:56] (03PS40) 10Slyngshede: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [11:46:22] !log ayounsi@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2007.codfw.wmnet with OS bookworm [11:46:49] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet [11:51:00] !log akosiaris@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:51:31] !log akosiaris@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:51:36] !log akosiaris@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:51:38] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.e3 [11:51:40] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet [11:51:41] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.e4 [11:52:29] !log akosiaris@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:53:08] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [11:54:43] jmm@cumin1003 drain-node (PID 3296372) is awaiting input [11:55:51] (03PS1) 10Slyngshede: data.yaml: Extend MOU for dhardy [puppet] - 10https://gerrit.wikimedia.org/r/1164169 [11:57:12] (03PS1) 10Fabfur: cache,haproxy: set requestctl in x-analytics if not set by varnish [puppet] - 10https://gerrit.wikimedia.org/r/1164170 (https://phabricator.wikimedia.org/T397917) [11:57:18] (03CR) 10Edgar Allan Poe: "LGTM, time to fix this, in line with community needs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164146 (https://phabricator.wikimedia.org/T397913) (owner: 10Acamicamacaraca) [11:57:44] !log ayounsi@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2007.codfw.wmnet with OS bookworm [11:58:35] !log ayounsi@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2007.codfw.wmnet with OS bookworm [12:00:02] (03PS1) 10Ladsgroup: tables catalog: Change visibility of wb_changes [puppet] - 10https://gerrit.wikimedia.org/r/1164171 [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T1200) [12:00:17] !log akosiaris@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [12:00:45] !log akosiaris@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [12:00:49] !log akosiaris@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [12:00:50] (03PS3) 10Muehlenhoff: Support passing multiple servers [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1164153 (https://phabricator.wikimedia.org/T397696) [12:01:00] (03CR) 10Edgar Allan Poe: [C:03+1] "LGTM, time to fix this, in line with community needs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164146 (https://phabricator.wikimedia.org/T397913) (owner: 10Acamicamacaraca) [12:01:27] !log akosiaris@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [12:02:44] (03CR) 10CI reject: [V:04-1] Support passing multiple servers [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1164153 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [12:05:27] (03CR) 10Jcrespo: [C:03+2] bacula: Create a temporary backup job for long term Archival [puppet] - 10https://gerrit.wikimedia.org/r/1164158 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [12:07:01] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.e4 [12:07:04] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.e5 [12:08:52] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164170 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [12:12:38] (03PS1) 10Alexandros Kosiaris: Remove old mw-wikifunctions RRs [dns] - 10https://gerrit.wikimedia.org/r/1164174 (https://phabricator.wikimedia.org/T384944) [12:13:30] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:13:58] (03CR) 10Alexandros Kosiaris: [C:03+2] Remove old mw-wikifunctions RRs [dns] - 10https://gerrit.wikimedia.org/r/1164174 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [12:14:21] !log akosiaris@dns1004 START - running authdns-update [12:15:10] (03PS1) 10Esanders: Remove wgVisualEditorEditCheckSingleCheckMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164178 [12:15:23] !log akosiaris@dns1004 END - running authdns-update [12:15:32] (03PS1) 10Mvolz: Revert^2 "Change citoid config for test wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164179 [12:15:39] (03PS1) 10Hnowlan: mobileapps: remove CPU limits in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164180 (https://phabricator.wikimedia.org/T397750) [12:15:55] (03PS2) 10Mvolz: Revert^2 "Change citoid config for test wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164179 [12:16:04] 06SRE, 10SRE-Access-Requests: Update katelevan's ssh key - https://phabricator.wikimedia.org/T397832#10950109 (10Nahid) [12:17:47] (03PS3) 10Mvolz: Redo "Change citoid config for test wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164179 (https://phabricator.wikimedia.org/T361576) [12:18:40] (03PS4) 10Mvolz: Redo "Change citoid config for test wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164179 (https://phabricator.wikimedia.org/T361576) [12:18:42] (03PS2) 10Hnowlan: mobileapps: remove CPU limits in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164180 (https://phabricator.wikimedia.org/T397750) [12:19:36] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2055.codfw.wmnet [12:22:28] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.e5 [12:22:31] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.e6 [12:23:27] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2240.codfw.wmnet with reason: Maintenance [12:23:28] (03PS3) 10Hnowlan: mobileapps: remove CPU limits in prod, drop replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164180 (https://phabricator.wikimedia.org/T397750) [12:23:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2240 (T391056)', diff saved to https://phabricator.wikimedia.org/P78707 and previous config saved to /var/cache/conftool/dbconfig/20250626-122333-fceratto.json [12:23:40] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [12:23:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:24:09] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2055.codfw.wmnet [12:25:01] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/1164186 (owner: 10L10n-bot) [12:29:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T391056)', diff saved to https://phabricator.wikimedia.org/P78708 and previous config saved to /var/cache/conftool/dbconfig/20250626-122952-fceratto.json [12:30:00] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [12:30:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:32:15] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10950141 (10ops-monitoring-bot) Draining ganeti2022.codfw.wmnet of running VMs [12:34:51] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397658#10950145 (10phaultfinder) [12:35:58] (03PS4) 10Hnowlan: mobileapps: remove CPU limits in prod, drop replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164180 (https://phabricator.wikimedia.org/T397750) [12:37:19] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.e6 [12:37:22] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.e7 [12:39:32] (03PS4) 10Muehlenhoff: Support passing multiple servers [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1164153 (https://phabricator.wikimedia.org/T397696) [12:40:25] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1164169 (owner: 10Slyngshede) [12:41:27] (03CR) 10Bartosz Dziewoński: [C:03+1] Force-clear toolbar after teardown [extensions/VisualEditor] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164165 (https://phabricator.wikimedia.org/T397914) (owner: 10Esanders) [12:43:16] (03CR) 10Volans: [C:03+1] "LGTM" [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1164153 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [12:45:24] (03PS3) 10Ladsgroup: Use table catalog for fullViews [puppet] - 10https://gerrit.wikimedia.org/r/1163846 (https://phabricator.wikimedia.org/T363581) [12:45:24] (03PS2) 10Ladsgroup: tables catalog: Change visibility of wb_changes [puppet] - 10https://gerrit.wikimedia.org/r/1164171 [12:47:36] (03CR) 10Jgiannelos: [C:03+2] changeprop: Debug if-unmodified-since impact on PCS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164154 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [12:47:40] (03PS3) 10Ladsgroup: tables catalog: Change visibility of wb_changes [puppet] - 10https://gerrit.wikimedia.org/r/1164171 [12:47:48] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables catalog: Change visibility of wb_changes [puppet] - 10https://gerrit.wikimedia.org/r/1164171 (owner: 10Ladsgroup) [12:49:41] (03Merged) 10jenkins-bot: changeprop: Debug if-unmodified-since impact on PCS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164154 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [12:49:44] jouncebot: nowandnext [12:49:44] For the next 0 hour(s) and 10 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T1200) [12:49:44] In 0 hour(s) and 10 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T1300) [12:51:10] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164190 [12:52:15] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.e7 [12:52:17] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.e8 [12:52:38] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [12:52:48] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [12:53:25] (03PS41) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [12:53:36] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164190 (owner: 10PipelineBot) [12:54:53] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [12:55:19] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164190 (owner: 10PipelineBot) [12:55:35] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [12:55:42] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [12:56:11] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:56:22] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [12:56:32] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:57:25] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [12:57:29] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:57:31] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [12:57:33] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:57:39] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T1300). [13:00:05] _Gerges, effie, Aca, and edsanders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:07] <_Gerges> Here [13:00:08] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [13:00:15] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [13:00:38] (03CR) 10Ssingh: [C:03+1] "Thanks for taking care of beta as well! Nice cleanup." [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [13:00:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:01:00] o/ [13:01:02] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [13:01:02] * TheresNoTime can't deploy today [13:01:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment mobileapps-production in mobileapps at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [13:01:51] I can deploy! [13:01:55] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [13:02:08] (03CR) 10FNegri: "Is wikilove_log the only table that is missing from the catalog but still exists in one or more wikis? Does it exist in clouddbs only or i" [puppet] - 10https://gerrit.wikimedia.org/r/1163846 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [13:03:01] (03CR) 10Ssingh: [V:03+1] "This is ready for review. I will incrementally add the other bits but of course this is the base change." [puppet] - 10https://gerrit.wikimedia.org/r/1163858 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [13:03:32] (03CR) 10Alexandros Kosiaris: [C:03+2] "Cruft I should have deleted back then. I 've also found some more today in Ib8b4dde380c93cf4e4275d255e5c6a46dfaae969. Thanks for catching " [puppet] - 10https://gerrit.wikimedia.org/r/1163856 (https://phabricator.wikimedia.org/T384944) (owner: 10Scott French) [13:05:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 21.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:05:55] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.e8 [13:05:58] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.e9 [13:06:20] o/ I can self deploy [13:07:05] *waves* [13:07:16] <_Gerges> Lucas_WMDE: When you start my patch tag me [13:07:33] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "IMHO Mohanad on Phabricator is a bit confused about the requirements for a config change, but nevertheless this seems okay to deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164145 (https://phabricator.wikimedia.org/T397845) (owner: 10GergesShamon) [13:08:05] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] debug.json: remove mwdebugX hosts (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164148 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [13:08:12] Lucas_WMDE: I have a high/UBN for VE to deploy [13:08:13] (03CR) 10Milimetric: [C:03+1] Use table catalog for fullViews [puppet] - 10https://gerrit.wikimedia.org/r/1163846 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [13:08:31] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:08:35] edsanders: okay, then go ahead [13:08:52] _Gerges: sounds like edsanders is going first, then I’ll deploy your change [13:09:09] Thanks all [13:09:10] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Seems reasonable and in line with the other configs here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164146 (https://phabricator.wikimedia.org/T397913) (owner: 10Acamicamacaraca) [13:09:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164165 (https://phabricator.wikimedia.org/T397914) (owner: 10Esanders) [13:09:45] we can probably deploy the config changes for _Gerges, effie and Aca together, they all seem pretty low-risk to me (and separate enough that we shouldn’t have trouble telling their effects apart from one another) [13:10:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 21.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:11:04] (03Merged) 10jenkins-bot: Force-clear toolbar after teardown [extensions/VisualEditor] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164165 (https://phabricator.wikimedia.org/T397914) (owner: 10Esanders) [13:11:10] sure thing, ack [13:11:30] !log esanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1164165|Force-clear toolbar after teardown (T397914)]] [13:11:36] T397914: VE failing to load a second time after saving - https://phabricator.wikimedia.org/T397914 [13:13:47] !log esanders@deploy1003 esanders: Backport for [[gerrit:1164165|Force-clear toolbar after teardown (T397914)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:15:49] !log esanders@deploy1003 esanders: Continuing with sync [13:16:11] (03CR) 10Muehlenhoff: Support passing multiple servers (031 comment) [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1164153 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [13:16:18] (03CR) 10Muehlenhoff: [C:03+2] Support passing multiple servers [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1164153 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [13:16:37] (03CR) 10Vgutierrez: [C:03+2] cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [13:18:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:19:33] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.e9 [13:19:34] (03PS1) 10Lucas Werkmeister (WMDE): Empty change to test scap Depends-On handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164199 [13:19:36] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.ea [13:20:01] any objections to me trying out ^ that deployment at the end of this window? [13:20:33] CC tgr|away who reported that the Depends-On check didn’t work properly in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1153626 [13:20:54] (I’ve had some colleagues interested in the feature so I want to check that it still works and screenshot how it looks like ^^) [13:21:04] (03PS2) 10Ssingh: nagios_common and P:cache::haproxy: s/ats/cdn for SSL checks [puppet] - 10https://gerrit.wikimedia.org/r/1163843 [13:21:22] !log esanders@deploy1003 Finished scap sync-world: Backport for [[gerrit:1164165|Force-clear toolbar after teardown (T397914)]] (duration: 09m 52s) [13:21:30] (03CR) 10CI reject: [V:04-1] nagios_common and P:cache::haproxy: s/ats/cdn for SSL checks [puppet] - 10https://gerrit.wikimedia.org/r/1163843 (owner: 10Ssingh) [13:21:31] T397914: VE failing to load a second time after saving - https://phabricator.wikimedia.org/T397914 [13:21:54] (03PS5) 10Cmelo: Release the CampaignEvents extension to all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162967 (https://phabricator.wikimedia.org/T396784) [13:22:53] (03CR) 10CI reject: [V:04-1] Release the CampaignEvents extension to all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162967 (https://phabricator.wikimedia.org/T396784) (owner: 10Cmelo) [13:23:19] edsanders: all done? [13:23:30] yup [13:23:36] alright, thanks [13:23:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164145 (https://phabricator.wikimedia.org/T397845) (owner: 10GergesShamon) [13:23:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164148 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [13:23:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164146 (https://phabricator.wikimedia.org/T397913) (owner: 10Acamicamacaraca) [13:23:51] _Gerges, effie, Aca: deploying your changes now [13:23:53] (03PS2) 10Vgutierrez: cacheproxy: Report CPUs assigned to NIC queues [puppet] - 10https://gerrit.wikimedia.org/r/1161476 (https://phabricator.wikimedia.org/T397303) [13:23:58] (03PS2) 10Fabfur: cache,haproxy: set requestctl in x-analytics if not set by varnish [puppet] - 10https://gerrit.wikimedia.org/r/1164170 (https://phabricator.wikimedia.org/T397917) [13:24:01] cheers [13:24:02] ack [13:24:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164199 (owner: 10Lucas Werkmeister (WMDE)) [13:24:28] (03Merged) 10jenkins-bot: [arwikiversity] fix wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164145 (https://phabricator.wikimedia.org/T397845) (owner: 10GergesShamon) [13:24:42] (03Merged) 10jenkins-bot: debug.json: remove mwdebugX hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164148 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [13:24:45] (03Merged) 10jenkins-bot: Disable translations in sh-latn and sh-cyrl (wgTranslateDisabledTargetLanguages) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164146 (https://phabricator.wikimedia.org/T397913) (owner: 10Acamicamacaraca) [13:25:08] (03CR) 10Vgutierrez: [C:03+2] cacheproxy: Report CPUs assigned to NIC queues [puppet] - 10https://gerrit.wikimedia.org/r/1161476 (https://phabricator.wikimedia.org/T397303) (owner: 10Vgutierrez) [13:25:10] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1164145|[arwikiversity] fix wordmark (T397845)]], [[gerrit:1164148|debug.json: remove mwdebugX hosts (T397498)]], [[gerrit:1164146|Disable translations in sh-latn and sh-cyrl (wgTranslateDisabledTargetLanguages) (T397913)]] [13:25:18] T397845: Fix Arabic wikiversity wordmark - https://phabricator.wikimedia.org/T397845 [13:25:19] T397498: Deprecate mwdebugXXXX hosts - https://phabricator.wikimedia.org/T397498 [13:25:19] T397913: Disable translations in Serbo-Croatian Latin and Cyrillic scripts ($wgTranslateDisabledTargetLanguages) - https://phabricator.wikimedia.org/T397913 [13:27:01] (03CR) 10Andrew Bogott: keystone policy: allow object_storage role to create/delete ec2 creds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1163864 (https://phabricator.wikimedia.org/T396594) (owner: 10Andrew Bogott) [13:27:26] !log lucaswerkmeister-wmde@deploy1003 jiji, aleksandar, gergesshamon, lucaswerkmeister-wmde: Backport for [[gerrit:1164145|[arwikiversity] fix wordmark (T397845)]], [[gerrit:1164148|debug.json: remove mwdebugX hosts (T397498)]], [[gerrit:1164146|Disable translations in sh-latn and sh-cyrl (wgTranslateDisabledTargetLanguages) (T397913)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes c [13:27:26] an now be verified there. [13:28:06] checkin' [13:28:20] (03PS6) 10Cmelo: Release the CampaignEvents extension to all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162967 (https://phabricator.wikimedia.org/T396784) [13:29:00] (03CR) 10JHathaway: [C:03+1] reimage: add MAC address support for physical hosts - try #2 [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800 (owner: 10Ayounsi) [13:29:24] (03CR) 10Majavah: [C:04-1] keystone policy: allow object_storage role to create/delete ec2 creds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1163864 (https://phabricator.wikimedia.org/T396594) (owner: 10Andrew Bogott) [13:29:29] _Gerges, effie: please test as well :) [13:29:44] grand, hang on [13:29:59] <_Gerges> If possible, empty the arwikiversity server cache. [13:30:01] LGTM, script translations are now disabled, as expected. Main language translation is working. [13:30:05] the mwdebug removal seems to work for me [13:30:32] (I’m still weirded out by the fact that debug.json changes seem to take effect before they’ve been synced everywhere) [13:31:13] _Gerges: I’m not sure that’s necessary? I was able to see a change with Ctrl+F5 [13:31:36] I can purge wikiversity-wordmark-ar.svg after the deployment, but doing it now wouldn’t make a difference, I think [13:31:46] Lucas_WMDE: LGTM too [13:32:22] (Checked for Meta, and Media-Wiki and WMF Governance Wiki.) [13:32:23] <_Gerges> Ok [13:32:31] (03PS1) 10Volans: base template: fix CSS/JS includes [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164202 (https://phabricator.wikimedia.org/T397696) [13:32:37] !log mvernon@cumin1002 END (FAIL) - Cookbook sre.swift.check-dbs (exit_code=1) Checking container DBs of wikipedia-commons-local-thumb.ea [13:32:40] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.eb [13:33:19] 06SRE, 10SRE-Access-Requests: Update katelevan's ssh key - https://phabricator.wikimedia.org/T397832#10950374 (10Fabfur) Key confirmed also on separate channel. [13:34:01] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164202 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [13:34:03] (03PS1) 10Eevans: sessionstore200[45]: (re)reimage for JBOD [puppet] - 10https://gerrit.wikimedia.org/r/1164203 (https://phabricator.wikimedia.org/T390514) [13:34:04] (03PS1) 10Eevans: sessionstore2004: updated data_file_directories set [puppet] - 10https://gerrit.wikimedia.org/r/1164204 (https://phabricator.wikimedia.org/T390514) [13:34:04] (03PS1) 10Eevans: sessionstore2005: updated data_file_directories set [puppet] - 10https://gerrit.wikimedia.org/r/1164205 (https://phabricator.wikimedia.org/T390514) [13:34:18] _Gerges: I’m not sure what that Ok means ^^ are you still testing or does it look okay? [13:34:37] (03CR) 10CDanis: [C:03+1] base template: fix CSS/JS includes [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164202 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [13:34:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162967 (https://phabricator.wikimedia.org/T396784) (owner: 10Cmelo) [13:35:08] (03PS42) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [13:35:14] !log manual clean up of update-ocsp.d leftovers in cp hosts [13:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:46] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164170 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [13:36:37] (03CR) 10Vgutierrez: [C:04-1] cache,haproxy: set requestctl in x-analytics if not set by varnish (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1164170 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [13:37:33] (03CR) 10Xcollazo: [C:03+1] "So much work to get to this point! Thanks Ben and Balthazar!" [puppet] - 10https://gerrit.wikimedia.org/r/1164150 (https://phabricator.wikimedia.org/T397848) (owner: 10Btullis) [13:37:38] (03Abandoned) 10Effie Mouzeli: data.yaml: allow deployers to restart php8.1-fpm [puppet] - 10https://gerrit.wikimedia.org/r/1145845 (owner: 10Effie Mouzeli) [13:38:22] guess I’ll go ahead with the deployment then [13:38:24] !log lucaswerkmeister-wmde@deploy1003 jiji, aleksandar, gergesshamon, lucaswerkmeister-wmde: Continuing with sync [13:38:33] (03CR) 10Xcollazo: [C:03+1] Dumps_v1: Stop updating dumps monitor HTML/JSON from the legacy system [puppet] - 10https://gerrit.wikimedia.org/r/1164157 (https://phabricator.wikimedia.org/T397848) (owner: 10Btullis) [13:38:34] (03PS1) 10Fabfur: data: new key for klevan@wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1164206 (https://phabricator.wikimedia.org/T397832) [13:38:39] (03PS1) 10Effie Mouzeli: trafficserver: remove mwdebugX XWD entries [puppet] - 10https://gerrit.wikimedia.org/r/1164207 (https://phabricator.wikimedia.org/T397498) [13:38:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Prepare db2240 for pool-in', diff saved to https://phabricator.wikimedia.org/P78710 and previous config saved to /var/cache/conftool/dbconfig/20250626-133845-fceratto.json [13:39:04] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2240 gradually with 4 steps - Pooling in [13:40:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:41:01] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [13:41:04] (03CR) 10CI reject: [V:04-1] trafficserver: remove mwdebugX XWD entries [puppet] - 10https://gerrit.wikimedia.org/r/1164207 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [13:42:06] <_Gerges> everything is fine [13:42:20] (03CR) 10Eevans: [C:03+2] sessionstore200[45]: (re)reimage for JBOD [puppet] - 10https://gerrit.wikimedia.org/r/1164203 (https://phabricator.wikimedia.org/T390514) (owner: 10Eevans) [13:42:24] (03PS3) 10Fabfur: cache,haproxy: set requestctl in x-analytics if not set by varnish [puppet] - 10https://gerrit.wikimedia.org/r/1164170 (https://phabricator.wikimedia.org/T397917) [13:42:30] (03PS2) 10Volans: base template: fix CSS/JS includes [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164202 (https://phabricator.wikimedia.org/T397696) [13:42:56] (03CR) 10Fabfur: cache,haproxy: set requestctl in x-analytics if not set by varnish (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1164170 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [13:43:05] (03PS2) 10Ssingh: hiera: cache/{text,upload}: use aliases for SANs [puppet] - 10https://gerrit.wikimedia.org/r/1163837 [13:43:32] (03PS2) 10Effie Mouzeli: trafficserver: remove mwdebugX XWD entries [puppet] - 10https://gerrit.wikimedia.org/r/1164207 (https://phabricator.wikimedia.org/T397498) [13:44:12] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1164145|[arwikiversity] fix wordmark (T397845)]], [[gerrit:1164148|debug.json: remove mwdebugX hosts (T397498)]], [[gerrit:1164146|Disable translations in sh-latn and sh-cyrl (wgTranslateDisabledTargetLanguages) (T397913)]] (duration: 19m 02s) [13:44:16] (03PS3) 10Ssingh: hiera: cache/{text,upload}: use aliases for SANs [puppet] - 10https://gerrit.wikimedia.org/r/1163837 [13:44:22] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1164206 (https://phabricator.wikimedia.org/T397832) (owner: 10Fabfur) [13:44:23] T397845: Fix Arabic wikiversity wordmark - https://phabricator.wikimedia.org/T397845 [13:44:24] T397498: Deprecate mwdebugXXXX hosts - https://phabricator.wikimedia.org/T397498 [13:44:24] T397913: Disable translations in Serbo-Croatian Latin and Cyrillic scripts ($wgTranslateDisabledTargetLanguages) - https://phabricator.wikimedia.org/T397913 [13:45:21] <_Gerges> Everything is now live? [13:45:32] !log decommissioning Cassandra/sessionstore2004-a — T391544 [13:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:37] T391544: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544 [13:45:45] !log lucaswerkmeister-wmde@deploy1003 $ echo https://en.wikipedia.org/static/images/mobile/copyright/wikiversity-wordmark-ar.svg | mwscript-k8s --attach --comment=T397845 -- purgeList enwiki [13:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:50] (03CR) 10CI reject: [V:04-1] trafficserver: remove mwdebugX XWD entries [puppet] - 10https://gerrit.wikimedia.org/r/1164207 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [13:46:08] Thanks for the deploy! :) [13:46:13] and now I’ll go ahead and test the Depends-On stuff [13:46:16] _Gerges: should be now, yes [13:46:18] Aca: yw :) [13:46:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164199 (owner: 10Lucas Werkmeister (WMDE)) [13:46:28] <_Gerges> Thanks:) [13:46:55] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.eb [13:46:58] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.ec [13:47:32] (03CR) 10Vgutierrez: [C:03+1] cache,haproxy: set requestctl in x-analytics if not set by varnish (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1164170 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [13:48:01] (03Merged) 10jenkins-bot: Empty change to test scap Depends-On handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164199 (owner: 10Lucas Werkmeister (WMDE)) [13:48:04] (03CR) 10Fabfur: [C:03+2] data: new key for klevan@wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1164206 (https://phabricator.wikimedia.org/T397832) (owner: 10Fabfur) [13:48:26] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1164199|Empty change to test scap Depends-On handling]] [13:48:32] hmmmmm [13:48:37] yeha that’s not supposed to happen [13:48:40] (03CR) 10Elukey: [C:03+1] Fix .wmfconfig settings [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164162 (owner: 10Volans) [13:48:41] *yeah [13:48:51] guess I’ll file a task [13:49:09] and… just let this deploy go through, I guess. I think aborting it would be worse [13:49:33] FIRING: ProbeDown: Service sessionstore2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore2004-a:7000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:49:39] (03PS3) 10Effie Mouzeli: trafficserver: remove mwdebugX XWD entries [puppet] - 10https://gerrit.wikimedia.org/r/1164207 (https://phabricator.wikimedia.org/T397498) [13:49:48] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Update katelevan's ssh key - https://phabricator.wikimedia.org/T397832#10950476 (10Fabfur) Change should be propagated shortly, please confirm if this can be marked as resolved [13:49:59] PROBLEM - SSH on cloudrabbit2003-dev is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:50:34] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Backport for [[gerrit:1164199|Empty change to test scap Depends-On handling]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:50:49] RECOVERY - SSH on cloudrabbit2003-dev is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:51:18] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore2004.codfw.wmnet with OS bullseye [13:51:29] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Continuing with sync [13:51:30] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10950491 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2004.c... [13:52:03] (03CR) 10CI reject: [V:04-1] trafficserver: remove mwdebugX XWD entries [puppet] - 10https://gerrit.wikimedia.org/r/1164207 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [13:52:12] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 14): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6083/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1163837 (owner: 10Ssingh) [13:52:39] filed T397931 [13:52:39] T397931: scap not complaining about dependencies only partially deployed with the train - https://phabricator.wikimedia.org/T397931 [13:52:54] (03PS4) 10Fabfur: cache,haproxy: set requestctl in x-analytics if not set by varnish [puppet] - 10https://gerrit.wikimedia.org/r/1164170 (https://phabricator.wikimedia.org/T397917) [13:52:56] (03PS1) 10Volans: debian: fix links to bootstrap [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1164212 (https://phabricator.wikimedia.org/T397696) [13:53:08] (03CR) 10Lucas Werkmeister (WMDE): "Scap didn’t warn about the change 😱" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164199 (owner: 10Lucas Werkmeister (WMDE)) [13:53:24] (03CR) 10Fabfur: cache,haproxy: set requestctl in x-analytics if not set by varnish (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1164170 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [13:53:31] FIRING: [2x] ProbeDown: Service sessionstore2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:53:43] PROBLEM - SSH on cloudrabbit2002-dev is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:53:59] (03PS4) 10Effie Mouzeli: trafficserver: remove mwdebugX XWD entries [puppet] - 10https://gerrit.wikimedia.org/r/1164207 (https://phabricator.wikimedia.org/T397498) [13:54:09] (03CR) 10CDanis: [C:03+1] debian: fix links to bootstrap [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1164212 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [13:54:33] RECOVERY - SSH on cloudrabbit2002-dev is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:55:33] (03PS2) 10Andrew Bogott: keystone policy: allow object_storage role to create/delete ec2 creds [puppet] - 10https://gerrit.wikimedia.org/r/1163864 (https://phabricator.wikimedia.org/T396594) [13:55:34] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-gp1004.eqiad.wmnet [13:55:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:56:02] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-gp2004.codfw.wmnet [13:56:30] (03CR) 10CI reject: [V:04-1] trafficserver: remove mwdebugX XWD entries [puppet] - 10https://gerrit.wikimedia.org/r/1164207 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [13:56:46] (03CR) 10Volans: [C:03+2] Fix .wmfconfig settings [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164162 (owner: 10Volans) [13:56:49] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164170 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [13:57:00] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1164199|Empty change to test scap Depends-On handling]] (duration: 08m 33s) [13:57:23] (03CR) 10Volans: [C:03+2] debian: fix links to bootstrap [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1164212 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [13:57:29] (I could revert that change but I don’t think it’s needed, might as well keep that harmless colon in a temporary comment) [13:57:33] !log UTC afternoon backport+config window done [13:57:37] (03Merged) 10jenkins-bot: Fix .wmfconfig settings [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164162 (owner: 10Volans) [13:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:16] (03Merged) 10jenkins-bot: debian: fix links to bootstrap [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1164212 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [13:58:33] (03PS1) 10Muehlenhoff: Move docker-report from build2001 to build2002 [puppet] - 10https://gerrit.wikimedia.org/r/1164219 (https://phabricator.wikimedia.org/T379343) [13:58:54] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1164212 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [13:59:23] !log akosiaris@cumin1003 START - Cookbook sre.dns.netbox [13:59:30] PROBLEM - Host cloudrabbit2001-dev is DOWN: PING CRITICAL - Packet loss = 100% [14:01:00] RECOVERY - Host cloudrabbit2001-dev is UP: PING OK - Packet loss = 0%, RTA = 30.33 ms [14:01:01] !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore2004.codfw.wmnet with OS bullseye [14:01:11] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10950570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2004.codfw... [14:01:44] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.ec [14:01:47] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.ed [14:01:54] (03CR) 10Slyngshede: [C:03+2] data.yaml: Extend MOU for dhardy [puppet] - 10https://gerrit.wikimedia.org/r/1164169 (owner: 10Slyngshede) [14:02:08] (03CR) 10Brouberol: [C:03+1] "woohoo" [puppet] - 10https://gerrit.wikimedia.org/r/1164150 (https://phabricator.wikimedia.org/T397848) (owner: 10Btullis) [14:02:14] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1004.eqiad.wmnet [14:02:21] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164219 (https://phabricator.wikimedia.org/T379343) (owner: 10Muehlenhoff) [14:02:22] (03CR) 10Brouberol: [C:03+1] Dumps_v1: Stop updating dumps monitor HTML/JSON from the legacy system [puppet] - 10https://gerrit.wikimedia.org/r/1164157 (https://phabricator.wikimedia.org/T397848) (owner: 10Btullis) [14:02:23] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2004.codfw.wmnet [14:02:43] (03CR) 10Volans: [C:03+2] base template: fix CSS/JS includes [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164202 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [14:02:44] (03PS1) 10Jelto: devtools: update hiera config for new bookworm hosts [puppet] - 10https://gerrit.wikimedia.org/r/1164222 (https://phabricator.wikimedia.org/T396622) [14:03:02] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [14:03:05] (03PS5) 10Hnowlan: mobileapps: remove CPU limits in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164180 (https://phabricator.wikimedia.org/T397750) [14:04:32] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore2004.codfw.wmnet with OS bullseye [14:04:39] (03CR) 10Ssingh: [V:03+1] "Rebased, ran PCC, no code change." [puppet] - 10https://gerrit.wikimedia.org/r/1163837 (owner: 10Ssingh) [14:04:43] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10950592 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2004.c... [14:04:52] (03PS6) 10Hnowlan: mobileapps: remove CPU limits in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164180 (https://phabricator.wikimedia.org/T397750) [14:04:53] akosiaris@cumin1003 netbox (PID 3311314) is awaiting input [14:05:10] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:05:13] !log sudo cumin 'A:cp' "disable-puppet 'merging CR 1163837'" [14:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:20] PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:05:55] (03CR) 10Jelto: [C:03+2] devtools: update hiera config for new bookworm hosts [puppet] - 10https://gerrit.wikimedia.org/r/1164222 (https://phabricator.wikimedia.org/T396622) (owner: 10Jelto) [14:06:20] (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: cache/{text,upload}: use aliases for SANs [puppet] - 10https://gerrit.wikimedia.org/r/1163837 (owner: 10Ssingh) [14:06:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (195.200.68.153) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:07:10] FIRING: [4x] BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:07:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:07:58] !log root@cumin1002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe [14:09:22] (03PS43) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [14:09:30] (03CR) 10Hnowlan: [C:03+2] mobileapps: remove CPU limits in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164180 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [14:11:21] (03Merged) 10jenkins-bot: mobileapps: remove CPU limits in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164180 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [14:11:40] !log root@cumin1002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe [14:12:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:12:33] jouncebot: nowandnext [14:12:33] No deployments scheduled for the next 0 hour(s) and 17 minute(s) [14:12:33] In 0 hour(s) and 17 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T1430) [14:13:12] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:13:18] RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:13:30] (03CR) 10Zabe: [C:03+2] beta: Set categorylinks to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164141 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [14:13:50] !log sudo cumin -b11 'A:cp' "run-puppet-agent --enable 'merging CR 1163837'" [14:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:01] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [14:14:23] (03Merged) 10jenkins-bot: beta: Set categorylinks to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164141 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [14:14:54] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [14:14:57] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [14:16:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (195.200.68.153) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:16:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [14:16:44] Deployment mw-experimental.eqiad.pinkllama in mw-experimental at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-experimental&var-deployment=mw-experimental.eqiad.pinkllama - ... [14:16:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [14:16:49] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [14:16:53] (03Merged) 10jenkins-bot: base template: fix CSS/JS includes [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1164202 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [14:17:08] (03PS44) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [14:17:10] RESOLVED: [4x] BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:17:13] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.ed [14:17:15] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.ee [14:17:26] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [14:18:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 20.29% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:18:31] RESOLVED: [2x] ProbeDown: Service sessionstore2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:18:33] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [14:18:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:19:45] !log dancy@deploy1003 Installing scap version "4.183.0" for 2 host(s) [14:20:12] (03PS6) 10Slyngshede: Add new Netbox records repo [dns] - 10https://gerrit.wikimedia.org/r/1164124 (https://phabricator.wikimedia.org/T362985) [14:20:14] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet [14:20:20] (03PS45) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [14:21:00] (03PS5) 10Elukey: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 [14:21:35] !log dancy@deploy1003 Installation of scap version "4.183.0" completed for 2 hosts [14:22:14] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [14:22:17] !log jmm@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on ganeti2022.codfw.wmnet with reason: remove for decom [14:23:09] (03CR) 10AOkoth: os_updates: manage stylesheet with puppet (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1163850 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [14:23:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 20.29% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:23:16] (03CR) 10CI reject: [V:04-1] WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (owner: 10Elukey) [14:23:44] (03PS3) 10AOkoth: os_updates: manage stylesheet with puppet [puppet] - 10https://gerrit.wikimedia.org/r/1163850 (https://phabricator.wikimedia.org/T350794) [14:24:11] (03CR) 10CI reject: [V:04-1] os_updates: manage stylesheet with puppet [puppet] - 10https://gerrit.wikimedia.org/r/1163850 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [14:24:15] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10950726 (10MoritzMuehlenhoff) [14:24:23] !log restart memcached on mc2038 and mc2039 [14:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:32] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2240 gradually with 4 steps - Pooling in [14:24:37] (03PS4) 10AOkoth: os_updates: manage stylesheet with puppet [puppet] - 10https://gerrit.wikimedia.org/r/1163850 (https://phabricator.wikimedia.org/T350794) [14:25:03] (03CR) 10CI reject: [V:04-1] os_updates: manage stylesheet with puppet [puppet] - 10https://gerrit.wikimedia.org/r/1163850 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [14:25:16] (03PS5) 10AOkoth: os_updates: manage stylesheet with puppet [puppet] - 10https://gerrit.wikimedia.org/r/1163850 (https://phabricator.wikimedia.org/T350794) [14:25:20] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-gp1005.eqiad.wmnet [14:27:07] (03PS2) 10Ssingh: P:cache::haproxy: properly indent profile (NOOP) [puppet] - 10https://gerrit.wikimedia.org/r/1163842 [14:27:33] !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore2004.codfw.wmnet with OS bullseye [14:27:45] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10950743 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2004.codfw... [14:27:52] jouncebot nowandnext [14:27:52] No deployments scheduled for the next 0 hour(s) and 2 minute(s) [14:27:52] In 0 hour(s) and 2 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T1430) [14:28:31] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [14:28:36] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6084/console" [puppet] - 10https://gerrit.wikimedia.org/r/1163842 (owner: 10Ssingh) [14:29:41] (03PS3) 10Ssingh: P:cache::haproxy: properly indent profile (NOOP) [puppet] - 10https://gerrit.wikimedia.org/r/1163842 [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T1430) [14:30:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:30:56] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6085/console" [puppet] - 10https://gerrit.wikimedia.org/r/1163842 (owner: 10Ssingh) [14:31:39] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1005.eqiad.wmnet [14:31:55] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore2004.codfw.wmnet with OS bullseye [14:32:08] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10950797 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2004.c... [14:32:56] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.ee [14:32:59] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.ef [14:33:51] (03CR) 10Ssingh: [V:03+1] "Rebased,NOOP on two cp hosts so should be NOOP for all others given the profile is applied." [puppet] - 10https://gerrit.wikimedia.org/r/1163842 (owner: 10Ssingh) [14:34:49] (03CR) 10Ssingh: [V:03+1 C:03+2] P:cache::haproxy: properly indent profile (NOOP) [puppet] - 10https://gerrit.wikimedia.org/r/1163842 (owner: 10Ssingh) [14:36:12] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:36:18] PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:36:26] (03PS1) 10Hnowlan: mobileapps: revert to original worker count, restore CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164227 (https://phabricator.wikimedia.org/T397750) [14:37:26] (03PS1) 10Muehlenhoff: Record LDAP access for amarkossian [puppet] - 10https://gerrit.wikimedia.org/r/1164228 [14:37:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (195.200.68.153) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr2-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:37:59] (03CR) 10Clément Goubert: [C:03+1] mobileapps: revert to original worker count, restore CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164227 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [14:38:02] (03PS3) 10Ssingh: nagios_common and P:cache::haproxy: s/ats/cdn for SSL checks [puppet] - 10https://gerrit.wikimedia.org/r/1163843 [14:38:10] FIRING: [4x] BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:39:44] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for amarkossian [puppet] - 10https://gerrit.wikimedia.org/r/1164228 (owner: 10Muehlenhoff) [14:40:30] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6086/console" [puppet] - 10https://gerrit.wikimedia.org/r/1163843 (owner: 10Ssingh) [14:41:01] (03PS4) 10Ssingh: nagios_common and P:cache::haproxy: s/ats/cdn for SSL checks [puppet] - 10https://gerrit.wikimedia.org/r/1163843 [14:42:10] !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore2004.codfw.wmnet with OS bullseye [14:42:20] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10950871 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2004.codfw... [14:42:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (195.200.68.153) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:42:50] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore2004.codfw.wmnet with OS bullseye [14:43:04] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10950887 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2004.c... [14:43:59] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6087/" [puppet] - 10https://gerrit.wikimedia.org/r/1163843 (owner: 10Ssingh) [14:45:12] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:45:18] RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:45:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:47:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (195.200.68.153) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:47:49] !log dancy@deploy1003 Installing scap version "4.182.0" for 2 host(s) [14:48:05] (03CR) 10Vgutierrez: [C:03+1] nagios_common and P:cache::haproxy: s/ats/cdn for SSL checks [puppet] - 10https://gerrit.wikimedia.org/r/1163843 (owner: 10Ssingh) [14:48:10] RESOLVED: [4x] BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:48:36] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.ef [14:48:39] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.f0 [14:49:28] !log dancy@deploy1003 Installation of scap version "4.182.0" completed for 2 hosts [14:49:37] (03PS1) 10Cwhite: validator: amend get_type to handle nulls [software/ecs] - 10https://gerrit.wikimedia.org/r/1164231 (https://phabricator.wikimedia.org/T234565) [14:50:24] (03CR) 10Cwhite: [C:03+2] validator: amend get_type to handle nulls [software/ecs] - 10https://gerrit.wikimedia.org/r/1164231 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [14:50:28] (03PS1) 10Volans: Upstream release v0.6.1 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1164232 [14:50:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:50:57] (03CR) 10Volans: [C:03+2] Upstream release v0.6.1 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1164232 (owner: 10Volans) [14:50:57] (03Merged) 10jenkins-bot: validator: amend get_type to handle nulls [software/ecs] - 10https://gerrit.wikimedia.org/r/1164231 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [14:52:09] (03Merged) 10jenkins-bot: Upstream release v0.6.1 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1164232 (owner: 10Volans) [14:54:39] (03CR) 10Cathal Mooney: "LGTM but I will let the traffic folk weigh in" [dns] - 10https://gerrit.wikimedia.org/r/1164124 (https://phabricator.wikimedia.org/T362985) (owner: 10Slyngshede) [14:55:06] !log uploaded debmonitor-server,python3-debmonitor_0.6.1 to apt.wikimedia.org bookworm-wikimedia [14:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:56:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:57:21] (03PS1) 10Klausman: hiera/thanos-swift: Fix MinT user [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491) [14:58:37] (03CR) 10CDanis: [C:03+1] cache,haproxy: set requestctl in x-analytics if not set by varnish [puppet] - 10https://gerrit.wikimedia.org/r/1164170 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [14:58:56] (03CR) 10Fabfur: [C:03+2] cache,haproxy: set requestctl in x-analytics if not set by varnish [puppet] - 10https://gerrit.wikimedia.org/r/1164170 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [14:59:35] (03CR) 10Eevans: [C:03+2] sessionstore2004: updated data_file_directories set [puppet] - 10https://gerrit.wikimedia.org/r/1164204 (https://phabricator.wikimedia.org/T390514) (owner: 10Eevans) [15:00:05] jeena and hashar: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T1500) [15:00:27] (03CR) 10Hnowlan: [C:03+2] mobileapps: revert to original worker count, restore CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164227 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [15:01:56] !log eevans@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2004.codfw.wmnet with reason: host reimage [15:02:13] (03Merged) 10jenkins-bot: mobileapps: revert to original worker count, restore CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164227 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [15:02:35] (03CR) 10Ssingh: [C:03+1] Add new Netbox records repo [dns] - 10https://gerrit.wikimedia.org/r/1164124 (https://phabricator.wikimedia.org/T362985) (owner: 10Slyngshede) [15:03:18] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.f0 [15:03:21] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.f1 [15:03:58] jouncebot: nowandnext [15:03:58] For the next 0 hour(s) and 56 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T1500) [15:03:58] In 0 hour(s) and 56 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T1600) [15:04:19] (03CR) 10Elukey: [C:03+1] "LGTM, but please add a few words in the commit msg about why we are doing it, and add Matthew in CC so he's aware :)" [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491) (owner: 10Klausman) [15:04:23] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:04:44] (03CR) 10Elukey: hiera/thanos-swift: Fix MinT user [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491) (owner: 10Klausman) [15:05:09] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:05:22] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2004.codfw.wmnet with reason: host reimage [15:05:25] (03PS6) 10Fabfur: cache,haproxy: set requestctl in x-analytics if not set by varnish [puppet] - 10https://gerrit.wikimedia.org/r/1164170 (https://phabricator.wikimedia.org/T397917) [15:05:28] (03PS2) 10Klausman: hiera/thanos-swift: Fix MinT user [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491) [15:05:36] (03CR) 10Elukey: hiera/thanos-swift: Fix MinT user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491) (owner: 10Klausman) [15:06:00] (03CR) 10CI reject: [V:04-1] hiera/thanos-swift: Fix MinT user [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491) (owner: 10Klausman) [15:06:02] (03PS2) 10Scott French: deployment_server: use bookworm httpd in all mediawiki releases [puppet] - 10https://gerrit.wikimedia.org/r/1164236 (https://phabricator.wikimedia.org/T378128) [15:06:36] (03PS1) 10Vgutierrez: hiera: Use the upload cert on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1164238 (https://phabricator.wikimedia.org/T394484) [15:06:40] !log akosiaris@cumin1003 START - Cookbook sre.dns.netbox [15:06:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:06:58] (03PS3) 10Klausman: hiera/thanos-swift: Fix MinT user [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491) [15:07:00] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164170 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [15:07:32] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164238 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [15:08:13] (03CR) 10Volans: [C:03+1] "Looks reasonable, ofc to be tested. I didn't spot any evident issue." [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800 (owner: 10Ayounsi) [15:08:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 20.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:08:38] (03CR) 10Hnowlan: [C:03+1] deployment_server: use bookworm httpd in all mediawiki releases [puppet] - 10https://gerrit.wikimedia.org/r/1164236 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [15:09:03] (03CR) 10Elukey: "Matthew: should we use mlserve:ro, rather than machinetranslation:prod? We'd need to have visibility on the buckets created by the ML acco" [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491) (owner: 10Klausman) [15:09:17] (03PS4) 10Klausman: hiera/thanos-swift: Fix MinT user [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491) [15:09:40] (03CR) 10Klausman: "Done." [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491) (owner: 10Klausman) [15:09:58] !log akosiaris@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Removal of old mw-wikifunctions PTR records - akosiaris@cumin1003" [15:12:05] (03CR) 10Fabfur: [C:03+2] cache,haproxy: set requestctl in x-analytics if not set by varnish [puppet] - 10https://gerrit.wikimedia.org/r/1164170 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [15:12:47] !log temporary disable puppet on cp7001 (T397917) [15:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:54] T397917: Append requestctl rule name to X-Analytics header in HAProxy - https://phabricator.wikimedia.org/T397917 [15:13:02] akosiaris@cumin1003 netbox (PID 3319343) is awaiting input [15:13:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 20.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:13:24] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.5.0 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1164240 [15:13:38] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v0.5.0 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1164240 (owner: 10Volans) [15:14:37] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [15:15:42] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Removal of old mw-wikifunctions PTR records - akosiaris@cumin1003" [15:15:42] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:16:45] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [15:17:10] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.f1 [15:17:13] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.f2 [15:18:31] FIRING: [2x] ProbeDown: Service sessionstore2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:18:39] (03PS2) 10Kosta Harlan: temp accounts: Enable temp account creation on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (https://phabricator.wikimedia.org/T397940) (owner: 10Tchanders) [15:18:47] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.5.0 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1164240 (owner: 10Volans) [15:19:25] (03PS1) 10Fabfur: haproxy: dummy patch to fix mistake on puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/1164241 [15:19:26] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:19:35] (03PS3) 10Kosta Harlan: temp accounts: Enable temp account creation on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (https://phabricator.wikimedia.org/T397940) (owner: 10Tchanders) [15:19:40] (03CR) 10CI reject: [V:04-1] temp accounts: Enable temp account creation on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (https://phabricator.wikimedia.org/T397940) (owner: 10Tchanders) [15:19:52] (03PS4) 10Kosta Harlan: temp accounts: Enable temp account creation on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (https://phabricator.wikimedia.org/T397940) (owner: 10Tchanders) [15:20:45] (03PS1) 10Scott French: Revert "mw-(api-ext|web): pilot 5% of traffic on new httpd images" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164242 (https://phabricator.wikimedia.org/T378128) [15:20:45] (03CR) 10CI reject: [V:04-1] temp accounts: Enable temp account creation on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (https://phabricator.wikimedia.org/T397940) (owner: 10Tchanders) [15:22:04] (03CR) 10Fabfur: [C:03+2] haproxy: dummy patch to fix mistake on puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/1164241 (owner: 10Fabfur) [15:22:13] (03PS1) 10Volans: Upstream release v0.5.0 [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/1164243 [15:22:33] (03CR) 10Volans: [C:03+2] Upstream release v0.5.0 [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/1164243 (owner: 10Volans) [15:24:31] (03Merged) 10jenkins-bot: Upstream release v0.5.0 [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/1164243 (owner: 10Volans) [15:26:14] (03CR) 10Ssingh: [C:03+1] "CR looks good and so does the idea as discussed." [puppet] - 10https://gerrit.wikimedia.org/r/1164238 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [15:27:38] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore2004.codfw.wmnet with OS bullseye [15:27:53] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10951069 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2004.codfw... [15:29:43] !log sudo cumin 'A:cp' "disable-puppet 'merging CR 1163843'" [15:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:00] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.f2 [15:30:03] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.f3 [15:31:20] (03CR) 10Ssingh: [V:03+1 C:03+2] nagios_common and P:cache::haproxy: s/ats/cdn for SSL checks [puppet] - 10https://gerrit.wikimedia.org/r/1163843 (owner: 10Ssingh) [15:31:41] !log bootstrapping Cassandra/sessionstore2004-a — T390514 [15:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:19] (03PS3) 10Scott French: Remove title-case overrides for PHP 8.1 migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152295 (https://phabricator.wikimedia.org/T394556) [15:33:31] FIRING: [2x] ProbeDown: Service sessionstore2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:33:50] (03PS6) 10Jcrespo: dbbackups: Disable read only backups and reenable regular rw es backups [puppet] - 10https://gerrit.wikimedia.org/r/1163694 (https://phabricator.wikimedia.org/T387892) [15:34:13] (03PS1) 10Cwhite: validator: keyword type can be an array of strings [software/ecs] - 10https://gerrit.wikimedia.org/r/1164245 (https://phabricator.wikimedia.org/T234565) [15:34:29] !log repooling cp7001 (T397917) [15:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:35] T397917: Append requestctl rule name to X-Analytics header in HAProxy - https://phabricator.wikimedia.org/T397917 [15:34:37] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [15:36:40] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#10951113 (10Jhancock.wm) a:03Jhancock.wm [15:37:22] updating the name of a nagios check command. being careful so there should be no alert spam. will silence in case there is but nothing to worry. [15:37:25] ocsp* [15:39:29] PROBLEM - Check correctness of the icinga configuration on alert1002 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [15:39:41] ^ ok looking [15:39:47] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10951133 (10jcrespo) I want to give you an (non-)update that I haven't forgotten about this- sadly, there was a need to do some bas... [15:40:02] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [15:40:03] !log joal@deploy1003 Started deploy [airflow-dags/analytics_test@f7c98a3]: Deploying artifacts to aiflow_dags/analytics_test [15:40:20] !log joal@deploy1003 Finished deploy [airflow-dags/analytics_test@f7c98a3]: Deploying artifacts to aiflow_dags/analytics_test (duration: 00m 16s) [15:40:27] !log uploaded debmonitor-client_0.5.0 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia,bookworm-wikimedia,trixie-wikimedia [15:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:49] (03PS6) 10Elukey: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 [15:42:55] (03CR) 10CI reject: [V:04-1] WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (owner: 10Elukey) [15:43:58] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.f3 [15:44:01] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.f4 [15:44:40] Icinga config alerts are expected, nothing to worry. clearing them out soon. [15:45:41] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:47:49] PROBLEM - Host an-worker1132 is DOWN: PING CRITICAL - Packet loss = 100% [15:48:18] (03PS1) 10Cwhite: validator: fix get_type to correctly handle ints [software/ecs] - 10https://gerrit.wikimedia.org/r/1164251 (https://phabricator.wikimedia.org/T234565) [15:48:32] (03CR) 10Cwhite: [C:03+2] validator: keyword type can be an array of strings [software/ecs] - 10https://gerrit.wikimedia.org/r/1164245 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [15:49:05] (03Merged) 10jenkins-bot: validator: keyword type can be an array of strings [software/ecs] - 10https://gerrit.wikimedia.org/r/1164245 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [15:51:09] (03PS1) 10Fabfur: hiera: set requestctl in x-analytics if not set by varnish (cp7006) [puppet] - 10https://gerrit.wikimedia.org/r/1164253 (https://phabricator.wikimedia.org/T397917) [15:52:05] (03PS1) 10Volans: debmonitor-next: fix envoy setup [puppet] - 10https://gerrit.wikimedia.org/r/1164254 (https://phabricator.wikimedia.org/T397696) [15:52:08] (03CR) 10Vgutierrez: [C:03+1] hiera: set requestctl in x-analytics if not set by varnish (cp7006) [puppet] - 10https://gerrit.wikimedia.org/r/1164253 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [15:52:13] !log sudo cumin -b11 'A:cp' "run-puppet-agent --enable 'merging CR 1163843'" [15:52:16] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164254 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [15:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:35] (03CR) 10Fabfur: [C:03+2] hiera: set requestctl in x-analytics if not set by varnish (cp7006) [puppet] - 10https://gerrit.wikimedia.org/r/1164253 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [15:53:07] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164254 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [15:55:25] FIRING: SystemdUnitFailed: isc-dhcp-server.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:00:04] jhathaway and moritzm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T1600). [16:00:04] dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:07] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.f4 [16:00:11] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.f5 [16:00:36] (03PS1) 10Giuseppe Lavagetto: release new version of hiddenparma [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1164260 [16:00:51] o/ [16:00:58] o/ [16:02:33] dancy: shall I merge both patches [16:02:40] Yes please [16:02:53] (03CR) 10JHathaway: [C:03+2] logspam.pl: Consolidate ThreadRevision unserialize() errors [puppet] - 10https://gerrit.wikimedia.org/r/1163833 (https://phabricator.wikimedia.org/T259111) (owner: 10Ahmon Dancy) [16:02:58] (03CR) 10JHathaway: [C:03+2] scap.cfg.erb: Drop unused php_fpm* config parameters [puppet] - 10https://gerrit.wikimedia.org/r/1155318 (https://phabricator.wikimedia.org/T396166) (owner: 10Ahmon Dancy) [16:03:28] dancy: rolling them out [16:03:37] (03PS2) 10Volans: debmonitor-next: fix internal endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1164254 (https://phabricator.wikimedia.org/T397696) [16:03:38] Thanks! [16:04:33] (03CR) 10CDanis: [C:03+1] debmonitor-next: fix internal endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1164254 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [16:05:24] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164254 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [16:05:25] RESOLVED: SystemdUnitFailed: isc-dhcp-server.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:05:27] (03PS7) 10Elukey: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 [16:06:03] (03PS2) 10Alexandros Kosiaris: mesh: Support retry_policy for upstream cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155126 (https://phabricator.wikimedia.org/T380958) [16:06:13] (03CR) 10Alexandros Kosiaris: [C:03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155125 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris) [16:07:33] (03CR) 10CI reject: [V:04-1] WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (owner: 10Elukey) [16:07:41] (03PS2) 10Cwhite: validator: fix get_type to correctly handle ints [software/ecs] - 10https://gerrit.wikimedia.org/r/1164251 (https://phabricator.wikimedia.org/T234565) [16:07:44] (03Merged) 10jenkins-bot: mesh: Add configuration_1.14 (copy/paste patch) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155125 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris) [16:07:55] (03CR) 10Volans: [C:03+2] debmonitor-next: fix internal endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1164254 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [16:08:31] RESOLVED: ProbeDown: Service sessionstore2004-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore2004-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:09:22] (03CR) 10Elukey: pyrra::filesystem::slos::istio: Fix PromQL to work with istio 1.24 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1163769 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [16:09:43] (03CR) 10AOkoth: os_updates: manage stylesheet with puppet (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1163850 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [16:10:17] (03PS2) 10Giuseppe Lavagetto: release new version of hiddenparma [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1164260 [16:10:29] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] release new version of hiddenparma [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1164260 (owner: 10Giuseppe Lavagetto) [16:10:49] (03CR) 10Cwhite: [C:03+2] validator: fix get_type to correctly handle ints [software/ecs] - 10https://gerrit.wikimedia.org/r/1164251 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [16:11:01] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Bugfixes; code refactoring - oblivian@cumin1003" [16:11:03] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfixes; code refactoring - oblivian@cumin1003 [16:11:13] (03Merged) 10jenkins-bot: validator: fix get_type to correctly handle ints [software/ecs] - 10https://gerrit.wikimedia.org/r/1164251 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [16:11:38] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfixes; code refactoring - oblivian@cumin1003 [16:11:39] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Bugfixes; code refactoring - oblivian@cumin1003" [16:13:30] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:13:34] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.f5 [16:13:37] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.f6 [16:14:07] Icinga alert should clear up now [16:14:13] RECOVERY - Check correctness of the icinga configuration on alert1002 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [16:16:00] (03PS3) 10Alexandros Kosiaris: mesh: Support retry_policy for upstream cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155126 (https://phabricator.wikimedia.org/T380958) [16:18:45] (03CR) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [16:19:02] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti2023 / ganeti2024 - https://phabricator.wikimedia.org/T397311#10951382 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:19:25] (03PS1) 10Volans: sretest: report to both debmonitor servers [puppet] - 10https://gerrit.wikimedia.org/r/1164262 (https://phabricator.wikimedia.org/T397696) [16:19:34] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164262 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [16:20:20] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164262 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [16:21:19] (03PS2) 10Volans: sretest: report to both debmonitor servers [puppet] - 10https://gerrit.wikimedia.org/r/1164262 (https://phabricator.wikimedia.org/T397696) [16:21:31] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164262 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [16:22:45] (03CR) 10Alexandros Kosiaris: [C:03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155126 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris) [16:24:16] (03Merged) 10jenkins-bot: mesh: Support retry_policy for upstream cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155126 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris) [16:24:32] (03PS3) 10Volans: sretest: report to both debmonitor servers [puppet] - 10https://gerrit.wikimedia.org/r/1164262 (https://phabricator.wikimedia.org/T397696) [16:25:08] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164262 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [16:26:53] (03CR) 10Vgutierrez: [C:04-2] "to be merged on 2025-06-30" [puppet] - 10https://gerrit.wikimedia.org/r/1164238 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [16:28:43] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcontrol2004-dev.codfw.wmnet - https://phabricator.wikimedia.org/T396396#10951416 (10Jhancock.wm) @Andrew this has been unracked and disks removed, but ran into an error running the offline script in netbox. lo... [16:28:54] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcontrol2004-dev.codfw.wmnet - https://phabricator.wikimedia.org/T396396#10951417 (10Jhancock.wm) [16:30:16] !log depool cp7006 for a quick test (T397917) [16:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:22] T397917: Append requestctl rule name to X-Analytics header in HAProxy - https://phabricator.wikimedia.org/T397917 [16:30:39] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.f6 [16:30:42] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.f7 [16:31:19] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7006.magru.wmnet [16:33:01] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7006.magru.wmnet [16:33:21] !log decommissioning Cassandra/sessionstore2005-a — T390514 [16:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:36] (03CR) 10Volans: "This of course fails because of:" [puppet] - 10https://gerrit.wikimedia.org/r/1164262 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [16:34:46] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): cirrussearch2079 iDRAC not working - https://phabricator.wikimedia.org/T396718#10951430 (10Jhancock.wm) fwiw, idrac is currently reachable on this one. not sure what changed when. [16:34:53] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7006.magru.wmnet [16:35:45] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore2005.codfw.wmnet with OS bullseye [16:35:59] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10951435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2005.c... [16:38:31] FIRING: [2x] ProbeDown: Service sessionstore2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:38:43] (03PS1) 10Alexandros Kosiaris: mesh: Bump mesh.configuration requirement to 1.14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164268 (https://phabricator.wikimedia.org/T380958) [16:38:46] (03PS1) 10Alexandros Kosiaris: mediawiki: Bump mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164269 (https://phabricator.wikimedia.org/T380958) [16:38:50] (03PS1) 10Alexandros Kosiaris: mw-debug: Specify upstream_retry_policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164270 (https://phabricator.wikimedia.org/T380958) [16:39:16] (03PS1) 10AikoChou: ml-services: update edit-check image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164271 (https://phabricator.wikimedia.org/T397013) [16:39:39] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2001.codfw.wmnet with OS bookworm [16:40:17] (03CR) 10CI reject: [V:04-1] mediawiki: Bump mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164269 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris) [16:40:24] (03CR) 10CI reject: [V:04-1] mesh: Bump mesh.configuration requirement to 1.14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164268 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris) [16:40:45] (03CR) 10CI reject: [V:04-1] mw-debug: Specify upstream_retry_policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164270 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris) [16:42:16] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:42:58] (03CR) 10Eevans: [C:03+2] sessionstore2005: updated data_file_directories set [puppet] - 10https://gerrit.wikimedia.org/r/1164205 (https://phabricator.wikimedia.org/T390514) (owner: 10Eevans) [16:43:10] !log mnz@deploy1003 Started deploy [airflow-dags/research@19c55cd]: (no justification provided) [16:43:52] !log mnz@deploy1003 Finished deploy [airflow-dags/research@19c55cd]: (no justification provided) (duration: 00m 48s) [16:44:40] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.f7 [16:44:43] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.f8 [16:44:51] (03PS2) 10Alexandros Kosiaris: mesh: Bump mesh.configuration requirement to 1.14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164268 (https://phabricator.wikimedia.org/T380958) [16:44:51] (03PS2) 10Alexandros Kosiaris: mediawiki: Bump mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164269 (https://phabricator.wikimedia.org/T380958) [16:44:51] (03PS2) 10Alexandros Kosiaris: mw-debug: Specify upstream_retry_policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164270 (https://phabricator.wikimedia.org/T380958) [16:44:54] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:48:03] jhancock@cumin1003 provision (PID 3330319) is awaiting input [16:48:16] (03CR) 10Alexandros Kosiaris: [C:03+2] mesh: Bump mesh.configuration requirement to 1.14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164268 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris) [16:49:24] (03PS1) 10Btullis: Ensure that master=yarn is the default spark configuration for users [puppet] - 10https://gerrit.wikimedia.org/r/1164272 (https://phabricator.wikimedia.org/T393181) [16:49:41] (03Merged) 10jenkins-bot: mesh: Bump mesh.configuration requirement to 1.14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164268 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris) [16:49:54] jhancock@cumin1003 provision (PID 3330135) is awaiting input [16:50:33] (03PS1) 10BryanDavis: developer-portal: Bump to 2025-06-19-122231-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164273 [16:50:40] (03PS2) 10Btullis: Ensure that master=yarn is the default spark configuration for users [puppet] - 10https://gerrit.wikimedia.org/r/1164272 (https://phabricator.wikimedia.org/T393181) [16:52:34] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.f8 [16:52:37] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.f9 [16:52:48] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:55:10] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:55:38] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [16:55:51] !log eevans@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2005.codfw.wmnet with reason: host reimage [16:56:58] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7006.magru.wmnet [16:57:03] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 13): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6089/c" [puppet] - 10https://gerrit.wikimedia.org/r/1164272 (https://phabricator.wikimedia.org/T393181) (owner: 10Btullis) [16:57:31] !log repooled cp7006 [16:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:31] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10951538 (10Andrew) [16:59:48] (03PS1) 10Fabfur: cache,haproxy: use http-after-response capture for x-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) [16:59:52] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump to 2025-06-19-122231-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164273 (owner: 10BryanDavis) [17:00:05] bd808: #bothumor My software never has bugs. It just develops random features. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T1700). [17:00:05] swfrench-wmf: #bothumor I � Unicode. All rise for MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T1700). [17:00:15] o/ [17:00:48] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1164236 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [17:00:50] (03CR) 10Scott French: [C:03+2] deployment_server: use bookworm httpd in all mediawiki releases [puppet] - 10https://gerrit.wikimedia.org/r/1164236 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [17:01:32] (03Merged) 10jenkins-bot: developer-portal: Bump to 2025-06-19-122231-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164273 (owner: 10BryanDavis) [17:02:02] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2005.codfw.wmnet with reason: host reimage [17:03:38] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm [17:03:47] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10951556 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host sretest2005.codfw.wmnet with OS bookworm [17:04:12] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:04:41] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [17:05:26] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:05:37] (03CR) 10CDanis: [C:03+1] cache,haproxy: use http-after-response capture for x-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1164275 (https://phabricator.wikimedia.org/T397917) (owner: 10Fabfur) [17:05:41] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:05:51] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:06:08] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:06:17] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:06:48] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:07:15] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.f9 [17:07:18] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.fa [17:08:31] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:09:22] !log swfrench@deploy1003 Started scap sync-world: Migrate all mediawiki releases to bookworm httpd images - T378128 [17:09:28] T378128: Upgrade httpd images to bullseye or bookworm - https://phabricator.wikimedia.org/T378128 [17:09:36] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [17:10:19] !log swfrench@deploy1003 swfrench: Migrate all mediawiki releases to bookworm httpd images - T378128 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:12:21] (03PS1) 10Sbisson: SX: Disable autoAddToCatchall on navigation tools [extensions/ContentTranslation] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164276 [17:12:45] jhancock@cumin1003 reimage (PID 3332910) is awaiting input [17:12:58] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [17:16:41] !log swfrench@deploy1003 swfrench: Continuing with sync [17:18:15] (03CR) 10Herron: pyrra::filesystem::slos::istio: Fix PromQL to work with istio 1.24 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1163769 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [17:20:43] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.fa [17:20:45] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.fb [17:21:14] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore2005.codfw.wmnet with OS bullseye [17:21:27] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10951584 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2005.codfw... [17:21:58] !log swfrench@deploy1003 Finished scap sync-world: Migrate all mediawiki releases to bookworm httpd images - T378128 (duration: 13m 01s) [17:22:03] T378128: Upgrade httpd images to bullseye or bookworm - https://phabricator.wikimedia.org/T378128 [17:22:29] (03CR) 10Scott French: [C:03+2] Revert "mw-(api-ext|web): pilot 5% of traffic on new httpd images" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164242 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [17:24:16] (03Merged) 10jenkins-bot: Revert "mw-(api-ext|web): pilot 5% of traffic on new httpd images" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164242 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [17:27:13] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [17:27:21] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [17:27:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:27:52] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:28:00] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:28:14] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2001.codfw.wmnet with OS bookworm [17:28:31] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [17:28:37] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [17:29:08] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:29:13] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:29:35] (03PS1) 10Cwhite: logstash: allow ecs-formatted scap logs to bypass migration filter [puppet] - 10https://gerrit.wikimedia.org/r/1164280 (https://phabricator.wikimedia.org/T397967) [17:32:31] (03CR) 10Cwhite: [C:03+2] logstash: allow ecs-formatted scap logs to bypass migration filter [puppet] - 10https://gerrit.wikimedia.org/r/1164280 (https://phabricator.wikimedia.org/T397967) (owner: 10Cwhite) [17:33:14] !log eevans@cumin1003 START - Cookbook sre.hosts.reboot-single for host sessionstore2005.codfw.wmnet [17:33:32] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164281 [17:34:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:34:46] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.fb [17:34:49] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.fc [17:35:00] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397658#10951645 (10phaultfinder) [17:35:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:37:29] jhancock@cumin2002 provision (PID 2337173) is awaiting input [17:39:52] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore2005.codfw.wmnet [17:40:46] !log eevans@cumin1003 START - Cookbook sre.hosts.reboot-single for host sessionstore2005.codfw.wmnet [17:41:00] 06SRE, 10SRE-Access-Requests: Update katelevan's ssh key - https://phabricator.wikimedia.org/T397832#10951665 (10KLevan) Hi all, this is now working. Thank you very much! [17:42:54] (03CR) 10Nik Gkountas: [C:03+1] SX: Disable autoAddToCatchall on navigation tools [extensions/ContentTranslation] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164276 (owner: 10Sbisson) [17:47:22] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore2005.codfw.wmnet [17:47:30] jouncebot nowandnext [17:47:30] For the next 0 hour(s) and 12 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T1700) [17:47:30] For the next 0 hour(s) and 12 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T1700) [17:47:30] In 0 hour(s) and 12 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T1800) [17:48:36] Hi, can I deploy an important fix for Content Translation before wmf.7 hits group 2? [17:49:05] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.fc [17:49:08] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.fd [17:51:33] !log dancy@deploy1003 Started scap sync-world: Testing T396166 [17:51:40] T396166: Are `php_fpm`/`php_version` inside `scap.cfg` used anymore? - https://phabricator.wikimedia.org/T396166 [17:54:22] stephanebisson: Check with jeena [17:55:09] jeena, can I deploy an important fix for Content Translation before wmf.7 hits group 2? [17:55:19] !log bootstrapping Cassandra/sessionstore2005-a — T390514 [17:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:57:35] stephanebisson: My guess is that you're okay to deploy before Jeena. [17:57:48] Especially if you're fixing something [17:58:06] (03PS1) 10Andrew Bogott: Prepare cloudcephosd200[12]-dev for decom [puppet] - 10https://gerrit.wikimedia.org/r/1164283 (https://phabricator.wikimedia.org/T397968) [17:58:08] (03PS1) 10Andrew Bogott: Prepare cloudcephosd2003-dev for decom [puppet] - 10https://gerrit.wikimedia.org/r/1164284 (https://phabricator.wikimedia.org/T397968) [17:58:09] (03PS1) 10Andrew Bogott: Remove puppet refs to cloudcephosd200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1164285 (https://phabricator.wikimedia.org/T397968) [17:58:40] sorry I just saw this [17:58:49] stephanebisson: you can go ahead [17:58:57] Thanks [17:59:12] thanks dancy ! [17:59:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164276 (owner: 10Sbisson) [17:59:48] !log dancy@deploy1003 Finished scap sync-world: Testing T396166 (duration: 08m 14s) [17:59:54] T396166: Are `php_fpm`/`php_version` inside `scap.cfg` used anymore? - https://phabricator.wikimedia.org/T396166 [18:00:05] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T1800) [18:00:10] (03CR) 10Andrew Bogott: [C:03+2] Prepare cloudcephosd200[12]-dev for decom [puppet] - 10https://gerrit.wikimedia.org/r/1164283 (https://phabricator.wikimedia.org/T397968) (owner: 10Andrew Bogott) [18:01:45] (03Merged) 10jenkins-bot: SX: Disable autoAddToCatchall on navigation tools [extensions/ContentTranslation] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1164276 (owner: 10Sbisson) [18:02:06] !log andrew@cumin1003 START - Cookbook sre.hosts.decommission for hosts cloudcephosd2001-dev.codfw.wmnet [18:02:11] !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1164276|SX: Disable autoAddToCatchall on navigation tools]] [18:03:17] (03CR) 10Xcollazo: [C:03+1] Ensure that master=yarn is the default spark configuration for users [puppet] - 10https://gerrit.wikimedia.org/r/1164272 (https://phabricator.wikimedia.org/T393181) (owner: 10Btullis) [18:04:16] !log sbisson@deploy1003 sbisson: Backport for [[gerrit:1164276|SX: Disable autoAddToCatchall on navigation tools]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:05:31] (03CR) 10Dr0ptp4kt: [C:03+1] Ensure that master=yarn is the default spark configuration for users [puppet] - 10https://gerrit.wikimedia.org/r/1164272 (https://phabricator.wikimedia.org/T393181) (owner: 10Btullis) [18:05:46] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.fd [18:05:49] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.fe [18:06:25] !log sbisson@deploy1003 sbisson: Continuing with sync [18:06:53] !log andrew@cumin1003 START - Cookbook sre.dns.netbox [18:10:45] 10ops-eqiad, 06DC-Ops: Power Supply - PS Redundancy - issue on an-worker1206:9290 - https://phabricator.wikimedia.org/T397978 (10phaultfinder) 03NEW [18:11:32] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcontrol2004-dev.codfw.wmnet - https://phabricator.wikimedia.org/T396396#10951792 (10Andrew) I think I have removed cloudcontrol2004-dev.private.codfw.wikimedia.cloud and the associated IP from netbox so hopefu... [18:11:34] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on logstash2035 - https://phabricator.wikimedia.org/T397366#10951794 (10Jhancock.wm) replaced drive in bay 1. coordinated on irc [18:11:41] !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1164276|SX: Disable autoAddToCatchall on navigation tools]] (duration: 09m 30s) [18:12:35] !log andrew@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephosd2001-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1003" [18:12:56] !log andrew@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephosd2001-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1003" [18:12:57] !log andrew@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:12:57] !log andrew@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts cloudcephosd2001-dev.codfw.wmnet [18:12:59] jeena, I'm done. Thanks! [18:13:38] (03PS1) 10Ahmon Dancy: scap.cfg.erb: Drop php_fpm_unsafe_restart_script [puppet] - 10https://gerrit.wikimedia.org/r/1164286 (https://phabricator.wikimedia.org/T396166) [18:14:14] !log andrew@cumin1003 START - Cookbook sre.hosts.decommission for hosts cloudcephosd2002-dev.codfw.wmnet [18:15:51] !log jasmine@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1032-1033].eqiad.wmnet [18:16:24] Is anyone around that can deploy a puppet patch for me? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1164286 [18:17:03] !log jasmine@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1032-1033].eqiad.wmnet [18:17:48] (03PS1) 10Michael Große: Growth: enable new way of refreshing LinkRecommendations for more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164287 (https://phabricator.wikimedia.org/T386250) [18:17:48] (03CR) 10Michael Große: [C:04-1] "Should not be merged before the work for T386867 has not concluded." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164287 (https://phabricator.wikimedia.org/T386250) (owner: 10Michael Große) [18:17:56] (03Abandoned) 10Dbrant: Add 'wikipedia:' to list of recognized protocols. [core] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1160802 (https://phabricator.wikimedia.org/T386004) (owner: 10Dbrant) [18:19:43] (03CR) 10Jasmine: [C:03+2] wikikube: decommission wikikube-worker103[23].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1151808 (https://phabricator.wikimedia.org/T383227) (owner: 10Jasmine) [18:20:08] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.fe [18:20:11] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.ff [18:21:15] !log andrew@cumin1003 START - Cookbook sre.dns.netbox [18:21:47] Thanks stephanebisson [18:22:34] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10951828 (10Andrew) [18:25:14] ACKNOWLEDGEMENT - MD RAID on logstash2035 is CRITICAL: CRITICAL: State: degraded, Active: 10, Working: 12, Failed: 0, Spare: 2 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T397980 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [18:25:23] RECOVERY - OpenSearch health check for shards on 9200 on logstash2035 is OK: OK - elasticsearch status production-elk7-codfw: cluster_name: production-elk7-codfw, status: green, timed_out: False, number_of_nodes: 20, number_of_data_nodes: 14, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 721, active_shards: 1618, relocating_shards: 6, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [18:25:23] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [18:25:23] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on logstash2035 - https://phabricator.wikimedia.org/T397980 (10ops-monitoring-bot) 03NEW [18:26:05] !log andrew@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephosd2002-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1003" [18:26:08] jasmine@cumin1002 decommission (PID 960092) is awaiting input [18:26:38] 10ops-codfw, 06cloud-services-team, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission cloudcephosd200[12]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T397968#10951842 (10Andrew) a:05Andrew→03None Some puppet refs remain, they will soon be removed in a batch when cloudcephosd... [18:26:43] !log andrew@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephosd2002-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1003" [18:26:43] !log andrew@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:26:44] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcephosd2002-dev.codfw.wmnet [18:26:51] 10ops-codfw, 06cloud-services-team, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission cloudcephosd200[12]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T397968#10951846 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1003 for hosts: `cloudcephosd200... [18:27:48] (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164290 (https://phabricator.wikimedia.org/T392177) [18:27:49] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.45.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164290 (https://phabricator.wikimedia.org/T392177) (owner: 10TrainBranchBot) [18:27:58] !log jasmine@cumin1002 START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1032-1033].eqiad.wmnet [18:28:18] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10951857 (10Andrew) [x] @Jhancock.wm will connect the second ports for cloudcephosd200[56]-dev [x] @Andrew will move the workload to the new nodes (part... [18:28:52] (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164290 (https://phabricator.wikimedia.org/T392177) (owner: 10TrainBranchBot) [18:34:49] (03PS2) 10Andrew Bogott: Prepare cloudcephosd2003-dev for decom [puppet] - 10https://gerrit.wikimedia.org/r/1164284 (https://phabricator.wikimedia.org/T397968) [18:34:49] (03PS2) 10Andrew Bogott: Remove puppet refs to cloudcephosd200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1164285 (https://phabricator.wikimedia.org/T397979) [18:34:50] (03PS1) 10Andrew Bogott: Remove puppet refs to cloudcephosd100[12].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1164291 (https://phabricator.wikimedia.org/T397968) [18:35:19] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.ff [18:36:48] !log jhuneidi@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.7 refs T392177 [18:36:53] T392177: 1.45.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T392177 [18:37:05] (03PS3) 10Andrew Bogott: Remove puppet refs to cloudcephosd2003 [puppet] - 10https://gerrit.wikimedia.org/r/1164285 (https://phabricator.wikimedia.org/T397979) [18:37:58] !log jasmine@cumin1002 START - Cookbook sre.dns.netbox [18:38:40] (03CR) 10Andrew Bogott: [C:03+2] Remove puppet refs to cloudcephosd100[12].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1164291 (https://phabricator.wikimedia.org/T397968) (owner: 10Andrew Bogott) [18:43:32] jasmine@cumin1002 decommission (PID 960092) is awaiting input [18:48:12] I am going to start a backport for https://phabricator.wikimedia.org/T388685 now [18:49:38] (03PS46) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [18:51:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163704 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE) [18:51:59] (03Merged) 10jenkins-bot: Revert^2 "Activate feature to resolve wikibase link labels in pilot wiki changelists" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163704 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE) [18:52:14] !log jhuneidi@deploy1003 Started scap sync-world: Backport for [[gerrit:1163704|Revert^2 "Activate feature to resolve wikibase link labels in pilot wiki changelists" (T388685)]] [18:52:20] T388685: Show labels for properties and items on Wikipedia watchlist summaries - https://phabricator.wikimedia.org/T388685 [18:53:05] !log jasmine@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1032-1033].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" [18:54:10] !log jhuneidi@deploy1003 joelyrookewmde, jhuneidi: Backport for [[gerrit:1163704|Revert^2 "Activate feature to resolve wikibase link labels in pilot wiki changelists" (T388685)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:54:20] !log jasmine@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1032-1033].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" [18:54:20] !log jasmine@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:54:21] !log jasmine@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wikikube-worker[1032-1033].eqiad.wmnet [18:58:46] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [18:59:55] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397983 (10phaultfinder) 03NEW [19:01:05] (03Abandoned) 10Dbrant: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161660 (owner: 10PipelineBot) [19:01:31] !log jhuneidi@deploy1003 joelyrookewmde, jhuneidi: Continuing with sync [19:07:26] !log jhuneidi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163704|Revert^2 "Activate feature to resolve wikibase link labels in pilot wiki changelists" (T388685)]] (duration: 15m 12s) [19:07:32] T388685: Show labels for properties and items on Wikipedia watchlist summaries - https://phabricator.wikimedia.org/T388685 [19:13:39] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): cirrussearch2079 iDRAC not working - https://phabricator.wikimedia.org/T396718#10951990 (10ayounsi) Looks like it's still failing, an upgrade is probably still needed. ` >>> r =spicerack.redfish('cirrussearch2079') Management Pass... [19:14:34] (03PS47) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [19:14:39] (03CR) 10Ayounsi: reimage: add MAC address support for physical hosts - try #2 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800 (owner: 10Ayounsi) [19:15:35] (03PS48) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [19:17:57] (03PS1) 10Ssingh: prometheus: add dnsbox_service_state_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1164296 (https://phabricator.wikimedia.org/T374619) [19:18:17] !log joal@deploy1003 Started deploy [airflow-dags/analytics@c3ba96d]: Deploy artifacts for airflow-dags/main [19:18:58] !log joal@deploy1003 Finished deploy [airflow-dags/analytics@c3ba96d]: Deploy artifacts for airflow-dags/main (duration: 00m 41s) [19:24:11] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [19:24:14] (03CR) 10Ssingh: "Sample output from a dns host:" [puppet] - 10https://gerrit.wikimedia.org/r/1164296 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [19:24:18] 06SRE, 10SRE-Access-Requests: Update katelevan's ssh key - https://phabricator.wikimedia.org/T397832#10952008 (10Fabfur) 05Open→03Resolved a:03Fabfur Glad it worked! [19:24:59] (03PS6) 10JHathaway: reimage: add dhcp MAC address support for physical hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800 (owner: 10Ayounsi) [19:26:00] (03CR) 10JHathaway: [C:03+2] reimage: add dhcp MAC address support for physical hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800 (owner: 10Ayounsi) [19:26:09] (03CR) 10JHathaway: [V:03+2 C:03+2] reimage: add dhcp MAC address support for physical hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800 (owner: 10Ayounsi) [19:28:06] (03CR) 10Ssingh: "For clarity: this approach ensures that we don't have to worry about which services are specific to which DNS host (which is indeed the ca" [puppet] - 10https://gerrit.wikimedia.org/r/1164296 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [19:30:16] jhancock@cumin2002 provision (PID 2337173) is awaiting input [19:31:23] (03CR) 10JHathaway: [C:03+1] reimage: temporarily store the MAC in Netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/1164151 (owner: 10Ayounsi) [19:32:56] (03PS3) 10Ayounsi: reimage: temporarily store the MAC in Netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/1164151 [19:34:03] (03CR) 10VolkerE: [C:03+1] [BETA CLUSTER] Stop loading VueTest, we're dropping it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164155 (https://phabricator.wikimedia.org/T357475) (owner: 10Jforrester) [19:34:15] (03CR) 10VolkerE: [C:03+1] Drop ability to use VueTest on a wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164156 (https://phabricator.wikimedia.org/T357475) (owner: 10Jforrester) [19:38:10] jhancock@cumin2002 provision (PID 2337173) is awaiting input [19:38:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 24.77% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:38:54] (03PS7) 10Jcrespo: dbbackups: Disable read only backups and reenable regular rw es backups [puppet] - 10https://gerrit.wikimedia.org/r/1163694 (https://phabricator.wikimedia.org/T387892) [19:43:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 24.77% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:45:41] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [19:55:13] RECOVERY - MD RAID on logstash2035 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T2000). [20:00:05] joelyrookewmde and cmelo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:06:23] (03PS1) 10Andrew Bogott: Remove a couple of incorrect comments. The cinder service password is now used. [labs/private] - 10https://gerrit.wikimedia.org/r/1164299 (https://phabricator.wikimedia.org/T273150) [20:06:25] (03PS1) 10Andrew Bogott: Add stand-in passwords for 'glance' service user. [labs/private] - 10https://gerrit.wikimedia.org/r/1164300 (https://phabricator.wikimedia.org/T273150) [20:06:27] (03PS1) 10Andrew Bogott: Add dummy ldap passwords for designate service user [labs/private] - 10https://gerrit.wikimedia.org/r/1164301 (https://phabricator.wikimedia.org/T273150) [20:07:22] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Remove a couple of incorrect comments. The cinder service password is now used. [labs/private] - 10https://gerrit.wikimedia.org/r/1164299 (https://phabricator.wikimedia.org/T273150) (owner: 10Andrew Bogott) [20:07:33] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Add stand-in passwords for 'glance' service user. [labs/private] - 10https://gerrit.wikimedia.org/r/1164300 (https://phabricator.wikimedia.org/T273150) (owner: 10Andrew Bogott) [20:07:46] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Add dummy ldap passwords for designate service user [labs/private] - 10https://gerrit.wikimedia.org/r/1164301 (https://phabricator.wikimedia.org/T273150) (owner: 10Andrew Bogott) [20:11:51] (03PS6) 10AOkoth: os_updates: manage stylesheet with puppet [puppet] - 10https://gerrit.wikimedia.org/r/1163850 (https://phabricator.wikimedia.org/T350794) [20:13:30] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [20:19:11] (03PS49) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [20:21:04] (03PS1) 10Jcrespo: Revert "bacula: Create a temporary backup job for long term Archival" [puppet] - 10https://gerrit.wikimedia.org/r/1164302 [20:22:51] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on logstash2035 - https://phabricator.wikimedia.org/T397366#10952095 (10colewhite) [20:22:52] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on logstash2035 - https://phabricator.wikimedia.org/T397980#10952093 (10colewhite) →14Duplicate dup:03T397366 [20:23:32] (03PS2) 10Jcrespo: Revert "bacula: Create a temporary backup job for long term Archival" [puppet] - 10https://gerrit.wikimedia.org/r/1164302 [20:23:35] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on logstash2035 - https://phabricator.wikimedia.org/T397366#10952098 (10colewhite) 05Open→03Resolved a:03colewhite Cluster is recovering nicely. Thank you! [20:24:15] (03CR) 10Jcrespo: [C:04-2] "Pending the archival of restored files." [puppet] - 10https://gerrit.wikimedia.org/r/1164302 (owner: 10Jcrespo) [20:26:31] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [20:26:38] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS bookworm [20:31:56] (03PS50) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [20:34:12] (03PS1) 10Andrew Bogott: Openstack glance: switch from novaadmin to 'glance' service user [puppet] - 10https://gerrit.wikimedia.org/r/1164303 (https://phabricator.wikimedia.org/T273150) [20:35:12] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1003.eqiad.wmnet with OS bookworm [20:35:58] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164303 (https://phabricator.wikimedia.org/T273150) (owner: 10Andrew Bogott) [20:36:03] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS bookworm [20:37:10] \0/ [20:37:29] cmelo do you need a deployer? [20:38:12] yes, I need to deploy a patch [20:38:24] okay I'll do it now [20:38:36] thank you so much [20:43:27] cmelo: I don't see the change Daimona requested for 'wmgUseCampaignEvents': "'private' => false, // Disable on private Wikipedias where the DB schema doesn't exist" in the patchset and they haven't added a +1 so I feel a bit uncomfortable deploying the change [20:45:26] Yes, we needed to remove it because it was failing CI and we needed to add some wikis manually and remove the private, we did it together early today [20:47:04] Daimona: could you please add a +1 to the patch? [20:48:00] sorry for the delay cmelo [20:48:28] yes, no problem, I have just sent a message to him [20:48:48] (03CR) 10Daimona Eaytoy: [C:03+1] Release the CampaignEvents extension to all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162967 (https://phabricator.wikimedia.org/T396784) (owner: 10Cmelo) [20:48:53] thank you so much! [20:48:55] Done, apologies! [20:49:02] thank you! [20:49:12] 👍 Just need to make sure since I don't really know much about that stuff [20:49:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162967 (https://phabricator.wikimedia.org/T396784) (owner: 10Cmelo) [20:49:57] (03CR) 10Scott French: [C:03+1] "Thanks for cleaning these up!" [puppet] - 10https://gerrit.wikimedia.org/r/1164286 (https://phabricator.wikimedia.org/T396166) (owner: 10Ahmon Dancy) [20:50:11] (03CR) 10Scott French: [C:03+2] scap.cfg.erb: Drop php_fpm_unsafe_restart_script [puppet] - 10https://gerrit.wikimedia.org/r/1164286 (https://phabricator.wikimedia.org/T396166) (owner: 10Ahmon Dancy) [20:50:17] (03Merged) 10jenkins-bot: Release the CampaignEvents extension to all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162967 (https://phabricator.wikimedia.org/T396784) (owner: 10Cmelo) [20:50:33] !log jhuneidi@deploy1003 Started scap sync-world: Backport for [[gerrit:1162967|Release the CampaignEvents extension to all Wikipedias (T396784)]] [20:50:39] T396784: Release the CampaignEvents extension to all remaining Wikipedias - https://phabricator.wikimedia.org/T396784 [20:52:17] jeena: Just FYI, there’s a new /private change that’ll deploy if you run a sync-world. Which should be fine as it’s related to a very targeted private mitigation update. [20:52:27] !log jhuneidi@deploy1003 cmelo, jhuneidi: Backport for [[gerrit:1162967|Release the CampaignEvents extension to all Wikipedias (T396784)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:52:30] oh okay thanks sbassett [20:52:47] cmelo: ready for any checks you need to do on mwdebug [20:53:08] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage [20:55:09] (03CR) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [20:56:31] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage [20:56:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Backplane failure for wikikube-worker1243.eqiad.wmnet - https://phabricator.wikimedia.org/T397851#10952153 (10Jclark-ctr) Confirmed: Service Request 212013802 [20:57:22] (03PS51) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250626T2100) [21:00:22] cmelo: Daimona can anyone confirm the changes? Should I continue sync? [21:00:59] I will test it now [21:01:06] great thank you [21:03:25] (03PS1) 10Eevans: sessionstore2006: reimage to JBOD configuration [puppet] - 10https://gerrit.wikimedia.org/r/1164305 (https://phabricator.wikimedia.org/T391544) [21:03:26] (03PS1) 10Eevans: sessionstore2006: setup JBOD-based data_file_directories [puppet] - 10https://gerrit.wikimedia.org/r/1164306 (https://phabricator.wikimedia.org/T391544) [21:03:29] (03PS1) 10Eevans: sessionstore2006: preseed d-i for partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/1164307 (https://phabricator.wikimedia.org/T391544) [21:03:43] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [21:05:26] (03PS52) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [21:06:15] (03CR) 10Eevans: [C:03+2] sessionstore2006: reimage to JBOD configuration [puppet] - 10https://gerrit.wikimedia.org/r/1164305 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [21:08:04] !log decommissioning Cassandra/sessionstore2006-a — T390514 [21:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:31] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:10:03] I am losing my session quite often, and it just removes the messages, jeena, could you tell me if it is already there, not sure if you already said that sorry for asking it again [21:11:00] Sorry I'm not sure what you mean? Are you asking if the changes have been deployed? [21:11:13] yes [21:11:23] Right now they are on mwdebug, which you can check using the mwdebug extension on your browser [21:11:31] Have you done this before? [21:12:06] Once confirmed on mwdebug, the changes can be fully deployed [21:12:12] ok, thanks yes I am testing it on mwdebug now but did not see the change yet, let me try again [21:12:23] Oh okay, you might try a hard refresh [21:12:28] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1003.eqiad.wmnet with OS bookworm [21:12:46] Daimona can you test it too please, I am testing on https://de.wikipedia.org/ [21:13:31] FIRING: [2x] ProbeDown: Service sessionstore2006-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:14:54] Ok, I see it there now, thank you jeena [21:15:03] Yes sorry I'm still half afk [21:15:16] ok, no problem [21:15:20] Thank you all, continuing with sync now [21:15:27] thank you! [21:15:37] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore2006.codfw.wmnet with OS bullseye [21:15:38] !log jhuneidi@deploy1003 cmelo, jhuneidi: Continuing with sync [21:15:49] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10952192 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2006.c... [21:20:40] Quickly checked a few wikis, looks good [21:21:17] !log jhuneidi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1162967|Release the CampaignEvents extension to all Wikipedias (T396784)]] (duration: 30m 43s) [21:21:24] T396784: Release the CampaignEvents extension to all remaining Wikipedias - https://phabricator.wikimedia.org/T396784 [21:21:37] same here, thanks jeena and Daimona [21:21:54] (03PS5) 10Scott French: P:etcd::tlsproxy: fix notify behavior for PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/1164264 (https://phabricator.wikimedia.org/T352245) [21:22:01] (03PS3) 10Scott French: hieradata: pilot cfssl/pki for nginx on conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/1164298 (https://phabricator.wikimedia.org/T352245) [21:22:14] deployment has finished 👍 [21:22:39] Ok. I’d like to get some security patches deployed now if that wraps up the backport window and the Web Team doesn’t have anything. [21:25:05] backports are complete [21:27:42] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:29:18] !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore2006.codfw.wmnet with OS bullseye [21:29:28] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10952206 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2006.codfw... [21:29:48] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore2006.codfw.wmnet with OS bullseye [21:29:58] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10952209 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2006.c... [21:31:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10952217 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [21:32:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10952221 (10Jclark-ctr) 05Resolved→03Open accidentally resolved ticket instead of assigning to my self [21:33:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10952225 (10Jclark-ctr) [21:35:21] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:36:32] (03CR) 10Eevans: [C:03+2] sessionstore2006: setup JBOD-based data_file_directories [puppet] - 10https://gerrit.wikimedia.org/r/1164306 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [21:38:31] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:43:48] (03PS53) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [21:45:24] !log Deployed security mitigations for T389010 and T395468 (sync-world) [21:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:31] FIRING: [2x] ProbeDown: Service sessionstore2006-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:49:07] (03CR) 10Cathal Mooney: "d" [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [21:49:33] RESOLVED: [2x] ProbeDown: Service sessionstore2006-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:52:05] !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore2006.codfw.wmnet with OS bullseye [21:52:17] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10952291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2006.codfw... [21:52:36] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore2006.codfw.wmnet with OS bullseye [21:52:53] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10952292 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2006.c... [22:00:24] (03PS54) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [22:03:41] !log joal@deploy1003 Started deploy [airflow-dags/analytics_test@1e85992]: HOTFIX - Deploy artifacts for airflow-dags/analytics_test [22:05:02] !log joal@deploy1003 Finished deploy [airflow-dags/analytics_test@1e85992]: HOTFIX - Deploy artifacts for airflow-dags/analytics_test (duration: 01m 21s) [22:06:50] !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore2006.codfw.wmnet with OS bullseye [22:07:08] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10952331 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2006.codfw... [22:07:23] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore2006.codfw.wmnet with OS bullseye [22:07:37] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10952332 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2006.c... [22:07:53] !log joal@deploy1003 Started deploy [airflow-dags/analytics@1e85992]: HOTFIX - Deploy artifacts for airflow-dags/analytics [22:08:18] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [22:08:30] !log joal@deploy1003 Finished deploy [airflow-dags/analytics@1e85992]: HOTFIX - Deploy artifacts for airflow-dags/analytics (duration: 00m 37s) [22:12:07] (03PS55) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [22:16:10] (03CR) 10Ladsgroup: ">Is wikilove_log the only table that is missing from the catalog but still exists in one or more wikis? Does it exist in clouddbs only or " [puppet] - 10https://gerrit.wikimedia.org/r/1163846 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [22:26:52] (03Restored) 10Andrea Denisse: centrallog: Disable temporary rsyslog debug config file. [puppet] - 10https://gerrit.wikimedia.org/r/1163901 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [22:27:27] eevans@cumin1003 reimage (PID 3365012) is awaiting input [22:28:41] !log eevans@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sessionstore2006.codfw.wmnet with OS bullseye [22:28:54] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10952360 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2006.codfw... [22:32:05] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore2006.codfw.wmnet with OS bullseye [22:32:20] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10952381 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2006.c... [22:34:11] (03PS56) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [22:35:21] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:37:30] (03PS57) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [22:38:31] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:39:36] !log joal@deploy1003 Started deploy [airflow-dags/analytics_test@6d2c335]: HOTFIX - Deploy artifacts for airflow-dags/analytics_test [22:39:57] !log joal@deploy1003 Finished deploy [airflow-dags/analytics_test@6d2c335]: HOTFIX - Deploy artifacts for airflow-dags/analytics_test (duration: 00m 21s) [22:49:23] !log eevans@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2006.codfw.wmnet with reason: host reimage [22:52:51] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2006.codfw.wmnet with reason: host reimage [22:55:45] (03PS1) 10JHathaway: dhcpd: add pxe-client-id [puppet] - 10https://gerrit.wikimedia.org/r/1164315 [22:59:19] (03PS2) 10JHathaway: dhcpd: add pxe-client-id [puppet] - 10https://gerrit.wikimedia.org/r/1164315 [23:02:42] (03PS1) 10JHathaway: dhcp: add a uuid based dhcp config [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 [23:04:45] (03PS2) 10JHathaway: dhcp: add a UUID based DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 [23:04:54] (03PS1) 10JHathaway: reimage: add support for using the host UUID for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 [23:11:34] (03CR) 10CI reject: [V:04-1] reimage: add support for using the host UUID for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway) [23:14:16] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore2006.codfw.wmnet with OS bullseye [23:14:32] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10952464 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2006.codfw... [23:14:37] (03CR) 10CI reject: [V:04-1] dhcp: add a UUID based DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 (owner: 10JHathaway) [23:18:31] FIRING: [2x] ProbeDown: Service sessionstore2006-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:22:12] !log eevans@cumin1003 START - Cookbook sre.hosts.reboot-single for host sessionstore2006.codfw.wmnet [23:27:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:28:42] (03CR) 10Dwisehaupt: "All clear for this to move forward." [puppet] - 10https://gerrit.wikimedia.org/r/1163851 (https://phabricator.wikimedia.org/T397868) (owner: 10Dwisehaupt) [23:28:52] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore2006.codfw.wmnet [23:33:21] !log bootstrapping Cassandra/sessionstore2006-a — T390514 [23:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:22] (03CR) 10Tim Starling: [C:03+1] Remove title-case overrides for PHP 8.1 migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152295 (https://phabricator.wikimedia.org/T394556) (owner: 10Scott French) [23:38:31] FIRING: [2x] ProbeDown: Service sessionstore2006-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:38:36] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1164319 [23:38:36] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1164319 (owner: 10TrainBranchBot) [23:45:41] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [23:45:41] !log dwisehaupt@dns1004 START - running authdns-update [23:46:39] !log dwisehaupt@dns1004 END - running authdns-update [23:48:19] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops, 13Patch-For-Review: Decommission frack hosts: frpig2001 pay-lvs2001 pay-lvs2002 - https://phabricator.wikimedia.org/T397868#10952532 (10Dwisehaupt) [23:49:06] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops, 13Patch-For-Review: Decommission frack hosts: frban1001 - https://phabricator.wikimedia.org/T397869#10952535 (10Dwisehaupt) [23:51:20] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1164319 (owner: 10TrainBranchBot) [23:58:31] RESOLVED: ProbeDown: Service sessionstore2006-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore2006-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown