[00:00:18] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10660042 (10RobH) I've now incremented up 1.8 to 1.10, skipping to 1.13 and up to 1.15 and then 1.16 (have to load 1.16 on the cumin host as well). It is getting around end of day here, so I may have to... [00:01:07] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [00:04:29] !log tstarling@deploy1003 Finished scap sync-world: Backport for [[gerrit:1129957|block: Don't modify an autoblock when the user specifies an IP (T389452)]] (duration: 26m 05s) [00:08:15] !log restart varnishkafka-all on A:cp-ulsfo [00:08:17] !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cp4047.ulsfo.wmnet [00:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:20] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp4047.ulsfo.wmnet [00:10:59] (03CR) 10BCornwall: [C:03+2] upgrade cp3078 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129864 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [00:12:10] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp3078.esams.wmnet} and A:cp [00:15:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660050 (10phaultfinder) [00:17:37] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp3078.esams.wmnet} and A:cp [00:23:00] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4047.ulsfo.wmnet with reason: BIOS upgrades [00:29:39] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:32:27] FIRING: [4x] ProbeDown: Service restbase1043-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:37:27] FIRING: [4x] ProbeDown: Service restbase1043-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:38:48] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1129970 [00:38:48] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1129970 (owner: 10TrainBranchBot) [00:39:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660093 (10phaultfinder) [00:43:53] FIRING: DDoSDetected: FastNetMon has detected an attack on esams #page - https://bit.ly/wmf-fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DDDoSDetected [00:44:39] FIRING: [4x] ProbeDown: Service restbase1043-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:45:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=esams - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:47:27] FIRING: [6x] ProbeDown: Service restbase1043-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:49:39] FIRING: [6x] ProbeDown: Service restbase1043-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:49:42] !incidents [00:49:43] 5765 (ACKED) DDoSDetected sre (netflow3003:9100 esams) [00:50:54] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1129970 (owner: 10TrainBranchBot) [00:58:28] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: 40% packet loss on ESAMS - https://phabricator.wikimedia.org/T389575#10660127 (10AlexisJazz) [00:59:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660129 (10phaultfinder) [01:01:42] FIRING: JobUnavailable: Reduced availability for job probes/swagger in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:04:39] FIRING: [5x] ProbeDown: Service restbase1043-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:05:00] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: 40% packet loss on ESAMS - https://phabricator.wikimedia.org/T389575#10660132 (10AlexisJazz) If I specifically ping ESAMS through my VPN, again packet loss: ` # ping text-lb.esams.wikimedia.org -c10 PING text-lb.esams.wikimedia.org (185.15.59.224) 56(84) bytes of da... [01:05:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=esams - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [01:06:42] RESOLVED: JobUnavailable: Reduced availability for job probes/swagger in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:07:04] (03PS1) 10Ssingh: sites: esams, set prepend_as_out true [homer/public] - 10https://gerrit.wikimedia.org/r/1129974 [01:07:21] !log sukhe@cumin1002 START - Cookbook sre.network.cf [01:07:22] !log sukhe@cumin1002 END (PASS) - Cookbook sre.network.cf (exit_code=0) [01:07:27] FIRING: [5x] ProbeDown: Service restbase1043-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:07:55] (03CR) 10Ssingh: [C:03+2] sites: esams, set prepend_as_out true [homer/public] - 10https://gerrit.wikimedia.org/r/1129974 (owner: 10Ssingh) [01:08:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1129975 [01:08:50] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1129975 (owner: 10TrainBranchBot) [01:09:39] FIRING: [5x] ProbeDown: Service restbase1043-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:09:57] !log running homer [01:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=esams - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [01:12:27] FIRING: [6x] ProbeDown: Service restbase1043-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:14:39] FIRING: [4x] ProbeDown: Service restbase1043-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:15:32] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: 40% packet loss on ESAMS - https://phabricator.wikimedia.org/T389575#10660148 (10Dylsss) I'm also having some pretty severe packet loss to esams. ` ping phabricator.wikimedia.org -c10 PING phabricator.wikimedia.org (185.15.59.224) 56(84) bytes of data. 64 bytes from... [01:16:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=esams - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [01:19:39] FIRING: [4x] ProbeDown: Service restbase1043-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:22:02] (03PS1) 10Ssingh: Revert "sites: esams, set prepend_as_out true" [homer/public] - 10https://gerrit.wikimedia.org/r/1129977 [01:24:09] !log sukhe@cumin1002 START - Cookbook sre.network.cf [01:24:09] !log sukhe@cumin1002 END (PASS) - Cookbook sre.network.cf (exit_code=0) [01:24:16] !log sukhe@cumin1002 START - Cookbook sre.network.cf [01:24:17] !log sukhe@cumin1002 END (PASS) - Cookbook sre.network.cf (exit_code=0) [01:24:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=esams - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [01:24:51] (03CR) 10Ssingh: [C:03+2] Revert "sites: esams, set prepend_as_out true" [homer/public] - 10https://gerrit.wikimedia.org/r/1129977 (owner: 10Ssingh) [01:27:27] FIRING: [5x] ProbeDown: Service restbase1043-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:27:38] !log running homer on cr*-esams [01:27:39] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: 40% packet loss on ESAMS - https://phabricator.wikimedia.org/T389575#10660166 (10BCornwall) 05Open→03In progress p:05Triage→03Unbreak! [01:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:10] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1129975 (owner: 10TrainBranchBot) [01:29:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=esams - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [01:38:25] (03PS1) 10Ssingh: sites: add prepend for drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/1129980 [01:38:53] RESOLVED: DDoSDetected: FastNetMon has detected an attack on esams #page - https://bit.ly/wmf-fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DDDoSDetected [01:50:04] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: 40% packet loss on ESAMS - https://phabricator.wikimedia.org/T389575#10660196 (10BCornwall) 05In progress→03Resolved a:03BCornwall Thank you for the report. We've looked into the issue and now the network is behaving properly again. [02:04:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660219 (10phaultfinder) [02:14:17] !log fixing corrupted blocks by directly updating the database for T389452 [02:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660226 (10phaultfinder) [02:37:06] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp4047.ulsfo.wmnet with reason: BIOS upgrades [03:04:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660240 (10phaultfinder) [03:18:20] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [03:18:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [03:25:45] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660244 (10phaultfinder) [03:39:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660253 (10phaultfinder) [03:59:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660285 (10phaultfinder) [04:01:07] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [04:14:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660322 (10phaultfinder) [04:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660324 (10phaultfinder) [04:44:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660330 (10phaultfinder) [05:04:46] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660363 (10phaultfinder) [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:19:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660382 (10phaultfinder) [05:56:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250321T0600) [06:05:22] (03PS2) 10Muehlenhoff: nginx: Remove prometheus.lua [puppet] - 10https://gerrit.wikimedia.org/r/1036672 [06:05:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036672 (owner: 10Muehlenhoff) [06:19:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660460 (10phaultfinder) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250321T0700) [07:02:27] !log installing vim security updates [07:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:40] (03PS1) 10Slyngshede: data.yaml Offboarding sharvaniharan [puppet] - 10https://gerrit.wikimedia.org/r/1130012 [07:03:19] (03CR) 10CI reject: [V:04-1] data.yaml Offboarding sharvaniharan [puppet] - 10https://gerrit.wikimedia.org/r/1130012 (owner: 10Slyngshede) [07:04:08] (03PS2) 10Slyngshede: data.yaml Offboarding sharvaniharan [puppet] - 10https://gerrit.wikimedia.org/r/1130012 [07:05:38] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1130012 (owner: 10Slyngshede) [07:07:34] (03CR) 10Slyngshede: [C:03+2] data.yaml Offboarding sharvaniharan [puppet] - 10https://gerrit.wikimedia.org/r/1130012 (owner: 10Slyngshede) [07:09:56] !log slyngshede@cumin1002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Sharvaniharan out of all services on: 2292 hosts [07:10:11] !log installing Linux 6.1.129 on Bookworm hosts [07:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:21] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [07:18:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [07:19:32] (03PS2) 10Phedenskog: grafana: Fix synthetic performance test JSON proxy endpoint. [puppet] - 10https://gerrit.wikimedia.org/r/1120160 (https://phabricator.wikimedia.org/T385750) [07:20:39] (03CR) 10CI reject: [V:04-1] grafana: Fix synthetic performance test JSON proxy endpoint. [puppet] - 10https://gerrit.wikimedia.org/r/1120160 (https://phabricator.wikimedia.org/T385750) (owner: 10Phedenskog) [07:22:20] (03PS3) 10Phedenskog: grafana: Fix synthetic performance test JSON proxy endpoint. [puppet] - 10https://gerrit.wikimedia.org/r/1120160 (https://phabricator.wikimedia.org/T385750) [07:22:44] (03CR) 10CI reject: [V:04-1] grafana: Fix synthetic performance test JSON proxy endpoint. [puppet] - 10https://gerrit.wikimedia.org/r/1120160 (https://phabricator.wikimedia.org/T385750) (owner: 10Phedenskog) [07:24:26] 10SRE-swift-storage, 10MediaWiki-libs-HTTP, 06MW-Interfaces-Team, 07Wikimedia-production-error: PHP Warning: Cannot modify header information - headers already sent by includes/libs/http/MultiHttpClient.php - https://phabricator.wikimedia.org/T369186#10660603 (10kostajh) >>! In T369186#10659728, @Tgr wrote... [07:25:54] (03CR) 10Ayounsi: [C:03+1] netbox: refactor support for GraphQL queries [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi) [07:34:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660629 (10phaultfinder) [07:34:37] (03PS4) 10Phedenskog: grafana: Fix synthetic performance test JSON proxy endpoint. [puppet] - 10https://gerrit.wikimedia.org/r/1120160 (https://phabricator.wikimedia.org/T385750) [07:39:46] (03CR) 10Phedenskog: "Thank you Filippo! I tried to follow your example, but now 100% sure I got it right. Please have a look when you have time!" [puppet] - 10https://gerrit.wikimedia.org/r/1120160 (https://phabricator.wikimedia.org/T385750) (owner: 10Phedenskog) [07:48:47] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10660644 (10MoritzMuehlenhoff) [07:54:25] !log krinkle@mwmaint: Fix actor_name encoding on cawiki for 1 row: actor_id=342864, per T389559 [07:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:29] T389559: Unable to view page information on [[Difracció]] article at ca.wikipedia.org - https://phabricator.wikimedia.org/T389559 [08:01:07] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [08:01:32] this is being worked on --^ [08:06:50] (03PS1) 10Muehlenhoff: Assign puppetserver role to puppetserver2004 [puppet] - 10https://gerrit.wikimedia.org/r/1130069 (https://phabricator.wikimedia.org/T381274) [08:06:51] (03PS1) 10Muehlenhoff: Assign puppetserver role to puppetserver2004 [puppet] - 10https://gerrit.wikimedia.org/r/1130070 (https://phabricator.wikimedia.org/T381274) [08:22:46] (03PS1) 10Slyngshede: P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 [08:24:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660755 (10phaultfinder) [08:26:02] (03PS2) 10Muehlenhoff: Configure puppetserver2004 [puppet] - 10https://gerrit.wikimedia.org/r/1130070 (https://phabricator.wikimedia.org/T381274) [08:26:05] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5121/co" [puppet] - 10https://gerrit.wikimedia.org/r/1130071 (owner: 10Slyngshede) [08:26:29] (03PS25) 10Fabfur: haproxy: using tmpfs directory for private tls material [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) [08:27:24] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve2002.codfw.wmnet with OS bookworm [08:27:39] (03CR) 10Fabfur: [C:04-2] "do not merge until 24/03" [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [08:29:19] (03CR) 10Elukey: [C:03+1] Assign puppetserver role to puppetserver2004 [puppet] - 10https://gerrit.wikimedia.org/r/1130069 (https://phabricator.wikimedia.org/T381274) (owner: 10Muehlenhoff) [08:30:03] (03CR) 10Elukey: [C:03+1] Configure puppetserver2004 [puppet] - 10https://gerrit.wikimedia.org/r/1130070 (https://phabricator.wikimedia.org/T381274) (owner: 10Muehlenhoff) [08:37:01] (03PS1) 10Muehlenhoff: Add service record for puppetserver2004 [dns] - 10https://gerrit.wikimedia.org/r/1130073 (https://phabricator.wikimedia.org/T381274) [08:55:05] (03CR) 10Elukey: [C:03+1] Add service record for puppetserver2004 [dns] - 10https://gerrit.wikimedia.org/r/1130073 (https://phabricator.wikimedia.org/T381274) (owner: 10Muehlenhoff) [09:00:11] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2002.codfw.wmnet with reason: host reimage [09:02:31] 06SRE, 06Traffic, 10WikimediaDebug, 07Developer Productivity, 13Patch-For-Review: Let X-Analytics response header pass through with WikimediaDebug - https://phabricator.wikimedia.org/T305794#10660809 (10Krinkle) [09:02:51] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10660810 (10Krinkle) [09:03:33] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2002.codfw.wmnet with reason: host reimage [09:04:24] (03PS1) 10Elukey: role::ml_k8s::worker: move ml-serve2003 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1130075 (https://phabricator.wikimedia.org/T387854) [09:04:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10660829 (10phaultfinder) [09:05:17] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1128399 (https://phabricator.wikimedia.org/T385947) (owner: 10Slyngshede) [09:06:08] (03CR) 10Muehlenhoff: [C:03+2] Bitu: Add obsolete test config [puppet] - 10https://gerrit.wikimedia.org/r/1128358 (owner: 10Muehlenhoff) [09:08:19] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 5 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5122/" [puppet] - 10https://gerrit.wikimedia.org/r/1130075 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [09:18:28] (03CR) 10Alexandros Kosiaris: [C:03+1] "Agreed. However, since both groups are still in the "proposed" state, this can go forward like it is now and we can alter the groups later" [puppet] - 10https://gerrit.wikimedia.org/r/1129292 (https://phabricator.wikimedia.org/T383947) (owner: 10Muehlenhoff) [09:25:37] (03PS5) 10Filippo Giunchedi: grafana: Fix synthetic performance test JSON proxy endpoint. [puppet] - 10https://gerrit.wikimedia.org/r/1120160 (https://phabricator.wikimedia.org/T385750) (owner: 10Phedenskog) [09:26:00] (03CR) 10CI reject: [V:04-1] grafana: Fix synthetic performance test JSON proxy endpoint. [puppet] - 10https://gerrit.wikimedia.org/r/1120160 (https://phabricator.wikimedia.org/T385750) (owner: 10Phedenskog) [09:27:55] !log imported python3-flask-sqlalchemy 2.1-4 to main component of wikimedia-bullseye (imported from bullseye-backports which will be archived soon) T383557 [09:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:59] T383557: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557 [09:28:36] (03CR) 10Filippo Giunchedi: "Last PS should work as expected! note that the change needs to happen to httpd class not httpd::site. I've also added the ssl proxy direct" [puppet] - 10https://gerrit.wikimedia.org/r/1120160 (https://phabricator.wikimedia.org/T385750) (owner: 10Phedenskog) [09:29:27] (03PS6) 10Filippo Giunchedi: grafana: Fix synthetic performance test JSON proxy endpoint. [puppet] - 10https://gerrit.wikimedia.org/r/1120160 (https://phabricator.wikimedia.org/T385750) (owner: 10Phedenskog) [09:34:40] (03CR) 10Filippo Giunchedi: [C:03+2] logstash: read k8s-mw topics as needed [puppet] - 10https://gerrit.wikimedia.org/r/1128793 (https://phabricator.wikimedia.org/T384335) (owner: 10Filippo Giunchedi) [09:35:04] (03PS1) 10Muehlenhoff: dynamicproxy::api: Install python3-flask-sqlalchemy from "main" component [puppet] - 10https://gerrit.wikimedia.org/r/1130078 (https://phabricator.wikimedia.org/T383557) [09:35:39] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2002.codfw.wmnet with OS bookworm [09:35:59] (03PS1) 10Brouberol: airflow-test-k8s: restore the initial instance settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130079 (https://phabricator.wikimedia.org/T386282) [09:36:01] (03PS1) 10Brouberol: airflow-main: drop the migration info message from the UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130080 (https://phabricator.wikimedia.org/T386282) [09:37:05] (03PS2) 10Slyngshede: P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 [09:39:14] (03CR) 10CI reject: [V:04-1] P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 (owner: 10Slyngshede) [09:42:10] (03CR) 10Btullis: [C:03+1] airflow-test-k8s: restore the initial instance settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130079 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:42:15] (03CR) 10Tiziano Fogli: [C:03+1] grafana: Fix synthetic performance test JSON proxy endpoint. [puppet] - 10https://gerrit.wikimedia.org/r/1120160 (https://phabricator.wikimedia.org/T385750) (owner: 10Phedenskog) [09:42:20] (03CR) 10Btullis: [C:03+1] airflow-main: drop the migration info message from the UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130080 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:46:32] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: restore the initial instance settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130079 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:46:36] (03CR) 10Brouberol: [C:03+2] airflow-main: drop the migration info message from the UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130080 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:47:23] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: DIMM B1 issues for ml-serve2002 - https://phabricator.wikimedia.org/T389472#10660949 (10elukey) Worked perfectly, thanks a lot! [09:47:25] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: DIMM B1 issues for ml-serve2002 - https://phabricator.wikimedia.org/T389472#10660950 (10elukey) 05Open→03Resolved [09:47:59] (03Merged) 10jenkins-bot: airflow-test-k8s: restore the initial instance settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130079 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:48:03] (03Merged) 10jenkins-bot: airflow-main: drop the migration info message from the UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130080 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:48:12] (03PS1) 10Muehlenhoff: Stop including bullseye-backports on Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/1130082 (https://phabricator.wikimedia.org/T383557) [09:48:38] (03PS1) 10Elukey: Revert "maps: fix id type for the table wikidata_relation_members in imposm_mapping" [puppet] - 10https://gerrit.wikimedia.org/r/1130083 [09:50:18] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:50:52] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:50:58] (03PS1) 10Muehlenhoff: apt::package_from_bpo: Fail if used on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1130084 (https://phabricator.wikimedia.org/T383557) [09:51:28] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [09:51:56] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [09:52:36] (03CR) 10Filippo Giunchedi: [C:03+2] grafana: Fix synthetic performance test JSON proxy endpoint. [puppet] - 10https://gerrit.wikimedia.org/r/1120160 (https://phabricator.wikimedia.org/T385750) (owner: 10Phedenskog) [09:52:50] (03CR) 10CI reject: [V:04-1] apt::package_from_bpo: Fail if used on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1130084 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [09:53:55] (03CR) 10Alexandros Kosiaris: [C:03+1] servicecatalog: add codesearch in state service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1128989 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [09:54:25] (03CR) 10Majavah: [C:03+1] dynamicproxy::api: Install python3-flask-sqlalchemy from "main" component [puppet] - 10https://gerrit.wikimedia.org/r/1130078 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [09:56:25] 06SRE, 06Infrastructure-Foundations, 10netops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10660991 (10Gehel) [09:56:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10661003 (10Gehel) [09:57:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10661033 (10Gehel) [09:57:52] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10661037 (10Gehel) [09:58:49] (03CR) 10Jgiannelos: [C:03+1] Revert "maps: fix id type for the table wikidata_relation_members in imposm_mapping" [puppet] - 10https://gerrit.wikimedia.org/r/1130083 (owner: 10Elukey) [10:00:13] 07sre-alert-triage, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Alert in need of triage: Dell PowerEdge RAID Controller (instance an-presto1016) - https://phabricator.wikimedia.org/T382714#10661105 (10Gehel) [10:01:02] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Enable CPU performance governor on Relforge, Cloudelastic, and Elasticsearch hosts - https://phabricator.wikimedia.org/T386860#10661121 (10Gehel) [10:03:11] (03PS3) 10Gkyziridis: inference-services: edit-check GPU version deployment on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129285 (https://phabricator.wikimedia.org/T386100) [10:03:37] (03CR) 10Gkyziridis: inference-services: edit-check GPU version deployment on staging. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129285 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [10:03:51] (03CR) 10Ladsgroup: "Awesome. Thanks!" [dumps] - 10https://gerrit.wikimedia.org/r/1128897 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [10:03:55] (03PS1) 10Brouberol: airflow-test-k8s: restore analytics-test values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130088 (https://phabricator.wikimedia.org/T386282) [10:04:19] (03PS1) 10Ilias Sarantopoulos: ml-services: udpate ml-staging ref-need deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130089 (https://phabricator.wikimedia.org/T387019) [10:04:23] (03PS2) 10Ilias Sarantopoulos: ml-services: udpate ml-staging ref-need deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130089 (https://phabricator.wikimedia.org/T387019) [10:04:27] (03PS2) 10Muehlenhoff: apt::package_from_bpo: Fail if used on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1130084 (https://phabricator.wikimedia.org/T383557) [10:06:12] (03CR) 10CI reject: [V:04-1] apt::package_from_bpo: Fail if used on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1130084 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [10:06:14] !log `alter sequence wikidata_relation_members_id_seq as bigint;` on maps1009's gis database - T389462 [10:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:18] T389462: OSM replication lag on maps1009 - https://phabricator.wikimedia.org/T389462 [10:11:02] (03CR) 10Ilias Sarantopoulos: inference-services: edit-check GPU version deployment on staging. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129285 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [10:12:28] (03PS3) 10Slyngshede: P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 [10:13:16] (03CR) 10Btullis: [C:03+1] airflow-test-k8s: restore analytics-test values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130088 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [10:13:20] (03CR) 10Elukey: [C:03+2] Revert "maps: fix id type for the table wikidata_relation_members in imposm_mapping" [puppet] - 10https://gerrit.wikimedia.org/r/1130083 (owner: 10Elukey) [10:13:45] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: restore analytics-test values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130088 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [10:14:26] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:14:38] (03CR) 10CI reject: [V:04-1] P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 (owner: 10Slyngshede) [10:14:58] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:15:23] (03PS4) 10Slyngshede: P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 [10:15:50] !log ALTER TABLE public.wikidata_relation_members ALTER COLUMN id TYPE bigint; on maps2009's posgres - T389462 [10:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:54] T389462: OSM replication lag on maps1009 - https://phabricator.wikimedia.org/T389462 [10:16:29] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:16:42] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:17:32] (03CR) 10CI reject: [V:04-1] P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 (owner: 10Slyngshede) [10:17:55] (03PS5) 10Slyngshede: P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 [10:19:05] (03PS1) 10Brouberol: mediawiki-dumps-legacy: dumps DAGs are now going to run on airflow-test-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130091 (https://phabricator.wikimedia.org/T386282) [10:20:03] (03CR) 10CI reject: [V:04-1] P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 (owner: 10Slyngshede) [10:20:05] (03PS3) 10Muehlenhoff: apt::package_from_bpo: Fail if used on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1130084 (https://phabricator.wikimedia.org/T383557) [10:20:43] (03CR) 10CI reject: [V:04-1] apt::package_from_bpo: Fail if used on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1130084 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [10:23:37] !log alter sequence wikidata_relation_members_id_seq as bigint; on maps2009's gis database - T389462 [10:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:48] T389462: OSM replication lag on maps1009 - https://phabricator.wikimedia.org/T389462 [10:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10661227 (10phaultfinder) [10:24:51] (03PS4) 10Gkyziridis: ml-services: edit-check GPU version deployment on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129285 (https://phabricator.wikimedia.org/T386100) [10:27:15] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: dumps DAGs are now going to run on airflow-test-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130091 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [10:28:10] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: dumps DAGs are now going to run on airflow-test-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130091 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [10:30:04] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [10:30:06] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [10:30:21] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [10:30:24] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [10:34:48] (03PS5) 10Gkyziridis: ml-services: edit-check GPU version experimental ns deployment on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129285 (https://phabricator.wikimedia.org/T386100) [10:40:03] (03PS4) 10Muehlenhoff: apt::package_from_bpo: Fail if used on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1130084 (https://phabricator.wikimedia.org/T383557) [10:40:26] (03PS1) 10Cathal Mooney: prepend_as_out: switch outbound policy rather than modify existing [homer/public] - 10https://gerrit.wikimedia.org/r/1130093 (https://phabricator.wikimedia.org/T389606) [10:40:58] (03PS2) 10Cathal Mooney: prepend_as_out: switch outbound policy rather than modify existing [homer/public] - 10https://gerrit.wikimedia.org/r/1130093 (https://phabricator.wikimedia.org/T389606) [10:41:55] (03PS3) 10Cathal Mooney: prepend_as_out: switch outbound policy rather than modify existing [homer/public] - 10https://gerrit.wikimedia.org/r/1130093 (https://phabricator.wikimedia.org/T389606) [10:41:55] (03CR) 10CI reject: [V:04-1] apt::package_from_bpo: Fail if used on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1130084 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [10:45:38] 06SRE, 06serviceops, 10Wikidata, 10Wikidata Integration in Wikimedia projects, 10Wikimedia-Site-requests: Increase entityAccessLimit for WikibaseClient wikis - https://phabricator.wikimedia.org/T384455#10661306 (10seanleong-WMDE) Hii, got it! We will create a patch to bump the limit for other wikis to 50... [10:46:20] (03PS1) 10Cathal Mooney: Add prepend-as-out variable for each site always [homer/public] - 10https://gerrit.wikimedia.org/r/1130095 (https://phabricator.wikimedia.org/T389606) [10:48:44] (03PS1) 10Aklapper: Remove wikimedia.org/resources redirect for Wikimedia Resource Center [puppet] - 10https://gerrit.wikimedia.org/r/1130096 (https://phabricator.wikimedia.org/T307965) [10:49:32] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:49:40] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:52:28] (03PS1) 10Ilias Sarantopoulos: api-gateway: allow anonymous requests to edit-check on ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130098 (https://phabricator.wikimedia.org/T388269) [10:56:06] 07sre-alert-triage, 06serviceops: Alert in need of triage: Postgres Replication Lag (instance maps-test2002) - https://phabricator.wikimedia.org/T388782#10661327 (10LSobanski) There are three overdue alerts for maps-test, two of which are critical. Can these be disabled or downgraded? [10:59:03] (03PS1) 10Muehlenhoff: maps: Cleanup confusing Hiera settings for postgresql replication slots [puppet] - 10https://gerrit.wikimedia.org/r/1130099 [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250321T0700) [11:00:05] jelto, arnoldokoth, and mutante: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for GitLab version upgrades deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250321T1100). [11:02:22] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130099 (owner: 10Muehlenhoff) [11:02:34] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: edit-check GPU version experimental ns deployment on staging. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129285 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [11:04:21] (03CR) 10Elukey: [V:03+1 C:03+2] role::ml_k8s::worker: move ml-serve2003 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1130075 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [11:07:15] (03CR) 10Elukey: [C:03+1] maps: Cleanup confusing Hiera settings for postgresql replication slots [puppet] - 10https://gerrit.wikimedia.org/r/1130099 (owner: 10Muehlenhoff) [11:09:20] (03CR) 10Alexandros Kosiaris: [C:04-1] "I am not liking this. It feels wrong to add in this group spiderpig, when it's supposed to be doing the same thing as well here, which is " [puppet] - 10https://gerrit.wikimedia.org/r/1129389 (owner: 10Ahmon Dancy) [11:10:35] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve2003.codfw.wmnet with OS bookworm [11:11:00] !log elukey@cumin1002 START - Cookbook sre.hosts.move-vlan for host ml-serve2003 [11:11:07] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [11:14:11] (03PS6) 10Slyngshede: P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 [11:16:24] (03CR) 10CI reject: [V:04-1] P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 (owner: 10Slyngshede) [11:16:39] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ml-serve2003 - elukey@cumin1002" [11:16:40] (03PS1) 10Muehlenhoff: maps/bookworm: Cleanup confusing Hiera settings for postgresql replication slots [puppet] - 10https://gerrit.wikimedia.org/r/1130100 (https://phabricator.wikimedia.org/T381565) [11:16:45] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ml-serve2003 - elukey@cumin1002" [11:16:45] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:16:45] !log elukey@cumin1002 START - Cookbook sre.dns.wipe-cache ml-serve2003.codfw.wmnet 29.32.192.10.in-addr.arpa 9.2.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:16:48] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ml-serve2003.codfw.wmnet 29.32.192.10.in-addr.arpa 9.2.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:16:49] !log elukey@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ml-serve2003 [11:17:09] !log elukey@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-serve2003 [11:17:09] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ml-serve2003 [11:18:20] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [11:18:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [11:19:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10661388 (10phaultfinder) [11:19:40] (03PS7) 10Slyngshede: P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 [11:19:43] (03PS1) 10Albertoleoncio: Add "PRE" (for NS_TEMPLATE) and "CAT" (for NS_CATEGORY) as namespace aliases in ptwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130101 [11:20:25] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5130/co" [puppet] - 10https://gerrit.wikimedia.org/r/1130071 (owner: 10Slyngshede) [11:20:55] (03PS2) 10Albertoleoncio: Add "PRE" (for NS_TEMPLATE) and "CAT" (for NS_CATEGORY) as namespace aliases in ptwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130101 (https://phabricator.wikimedia.org/T389609) [11:21:50] (03CR) 10CI reject: [V:04-1] P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 (owner: 10Slyngshede) [11:21:54] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130100 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:24:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 24 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130101 (https://phabricator.wikimedia.org/T389609) (owner: 10Albertoleoncio) [11:24:07] 06SRE, 06Infrastructure-Foundations, 10netops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10661391 (10cmooney) >>! In T381389#10583616, @xcollazo wrote: > @cmooney, should we move forward with this pat... [11:24:39] FIRING: ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:26:18] (03CR) 10Elukey: [C:03+1] maps/bookworm: Cleanup confusing Hiera settings for postgresql replication slots [puppet] - 10https://gerrit.wikimedia.org/r/1130100 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:27:27] RESOLVED: ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:30:21] (03CR) 10Muehlenhoff: [C:03+2] CAS: Add service definition for spiderpig [puppet] - 10https://gerrit.wikimedia.org/r/1129292 (https://phabricator.wikimedia.org/T383947) (owner: 10Muehlenhoff) [11:32:39] (03PS8) 10Slyngshede: P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 [11:34:51] (03CR) 10CI reject: [V:04-1] P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 (owner: 10Slyngshede) [11:41:47] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2003.codfw.wmnet with reason: host reimage [11:41:57] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5131/co" [puppet] - 10https://gerrit.wikimedia.org/r/1130071 (owner: 10Slyngshede) [11:42:07] 07sre-alert-triage, 06serviceops: Alert in need of triage: Postgres Replication Lag (instance maps-test2002) - https://phabricator.wikimedia.org/T388782#10661429 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert I think @elukey fixed those, they appear green in icinga now. Feel free to reopen... [11:43:18] 07sre-alert-triage, 06serviceops: Alert in need of triage: Postgres Replication Lag (instance maps-test2002) - https://phabricator.wikimedia.org/T388782#10661433 (10MoritzMuehlenhoff) You're wrong, but it's still fine to close :-) Luca fixed the replication lag for the main maps cluster, this is for the W... [11:45:17] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2003.codfw.wmnet with reason: host reimage [11:46:18] (03PS1) 10Muehlenhoff: Disable notifications for maps/bookworm during rampup phase [puppet] - 10https://gerrit.wikimedia.org/r/1130103 (https://phabricator.wikimedia.org/T388782) [11:49:18] (03PS9) 10Slyngshede: P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 [11:50:22] (03CR) 10Elukey: [C:03+1] Disable notifications for maps/bookworm during rampup phase [puppet] - 10https://gerrit.wikimedia.org/r/1130103 (https://phabricator.wikimedia.org/T388782) (owner: 10Muehlenhoff) [11:51:20] (03CR) 10Klausman: [C:03+1] api-gateway: allow anonymous requests to edit-check on ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130098 (https://phabricator.wikimedia.org/T388269) (owner: 10Ilias Sarantopoulos) [11:52:35] (03CR) 10Ilias Sarantopoulos: [C:03+2] api-gateway: allow anonymous requests to edit-check on ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130098 (https://phabricator.wikimedia.org/T388269) (owner: 10Ilias Sarantopoulos) [11:54:08] (03Merged) 10jenkins-bot: api-gateway: allow anonymous requests to edit-check on ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130098 (https://phabricator.wikimedia.org/T388269) (owner: 10Ilias Sarantopoulos) [11:54:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10661470 (10phaultfinder) [12:01:07] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [12:02:50] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2003.codfw.wmnet with OS bookworm [12:05:02] added a silence for the osm sync lag, maps1009 is catching up and it will take a bit [12:05:46] (03PS3) 10Phuedx: ext-EventStreamConfig: Reduce product_metrics.web_base data collection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129270 [12:09:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10661502 (10phaultfinder) [12:20:18] !log mlitn@deploy1003 Started deploy [airflow-dags/platform_eng@317134a]: (no justification provided) [12:20:46] !log mlitn@deploy1003 Finished deploy [airflow-dags/platform_eng@317134a]: (no justification provided) (duration: 00m 30s) [12:25:17] (03CR) 10Muehlenhoff: [C:03+2] Disable notifications for maps/bookworm during rampup phase [puppet] - 10https://gerrit.wikimedia.org/r/1130103 (https://phabricator.wikimedia.org/T388782) (owner: 10Muehlenhoff) [12:28:57] (03PS1) 10Muehlenhoff: Fix relforge Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1130104 [12:30:26] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1116.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:31:33] (03CR) 10Dreamy Jazz: Graph: Use new placeholder i18n from WikimediaMessages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester) [12:34:23] (03CR) 10Dreamy Jazz: Graph: Use new placeholder i18n from WikimediaMessages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester) [12:35:54] (03CR) 10Kamila Součková: [C:03+1] modules.cache.mcrouter: Copy for new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129802 (https://phabricator.wikimedia.org/T389480) (owner: 10Clément Goubert) [12:38:47] (03PS1) 10Muehlenhoff: osm: Handle new requirements for Postgres replication slots [puppet] - 10https://gerrit.wikimedia.org/r/1130106 (https://phabricator.wikimedia.org/T381565) [12:38:53] (03CR) 10Jforrester: Graph: Use new placeholder i18n from WikimediaMessages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester) [12:41:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130106 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [12:41:46] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1116.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:42:21] (03PS1) 10Federico Ceratto: Check ActionResult during depooling, extract dbctl_conf [cookbooks] - 10https://gerrit.wikimedia.org/r/1130107 (https://phabricator.wikimedia.org/T388384) [12:42:47] !log vacuum systemd journal logs down to 500M on registry200[4-5].codfw.wmnet [12:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:58] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1116.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:44:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10661681 (10phaultfinder) [12:46:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q2:install SSD (hot swap additions) to ml-lab100[12] - https://phabricator.wikimedia.org/T381394#10661683 (10VRiley-WMF) [12:48:59] (03CR) 10CI reject: [V:04-1] Check ActionResult during depooling, extract dbctl_conf [cookbooks] - 10https://gerrit.wikimedia.org/r/1130107 (https://phabricator.wikimedia.org/T388384) (owner: 10Federico Ceratto) [12:49:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10661692 (10phaultfinder) [12:50:13] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1116.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:50:40] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1115.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:51:14] (03PS2) 10Federico Ceratto: Check ActionResult during depooling, extract dbctl_conf [cookbooks] - 10https://gerrit.wikimedia.org/r/1130107 (https://phabricator.wikimedia.org/T388384) [12:52:21] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1115.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:52:58] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1114.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:53:51] (03CR) 10Federico Ceratto: "A small fix and cleanup. Tested in dry-run mode only." [cookbooks] - 10https://gerrit.wikimedia.org/r/1130107 (https://phabricator.wikimedia.org/T388384) (owner: 10Federico Ceratto) [12:54:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10661722 (10phaultfinder) [12:58:28] (03PS2) 10Muehlenhoff: osm: Handle new requirements for Postgres replication slots [puppet] - 10https://gerrit.wikimedia.org/r/1130106 (https://phabricator.wikimedia.org/T381565) [12:59:14] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1114.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:59:47] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1113.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:06:13] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1113.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:09:35] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1112.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:15:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1112.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:16:19] (03Abandoned) 10Ssingh: sites: add prepend for drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/1129980 (owner: 10Ssingh) [13:16:26] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130106 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:16:35] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1111.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:17:57] (03PS1) 10Bking: data-platform: Detune RdfStreamingUpdaterSpaceUsageTooHigh [alerts] - 10https://gerrit.wikimedia.org/r/1130114 (https://phabricator.wikimedia.org/T387920) [13:19:37] (03CR) 10CI reject: [V:04-1] data-platform: Detune RdfStreamingUpdaterSpaceUsageTooHigh [alerts] - 10https://gerrit.wikimedia.org/r/1130114 (https://phabricator.wikimedia.org/T387920) (owner: 10Bking) [13:19:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10661834 (10phaultfinder) [13:23:18] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1111.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:23:51] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1117.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:31:18] (03PS3) 10Muehlenhoff: osm: Handle new requirements for Postgres replication slots [puppet] - 10https://gerrit.wikimedia.org/r/1130106 (https://phabricator.wikimedia.org/T381565) [13:31:52] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host elastic1111.eqiad.wmnet with OS bullseye [13:32:02] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10661872 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host elastic1111.eqiad.wmnet with O... [13:32:05] (03PS3) 10Ilias Sarantopoulos: ml-services: udpate ml-staging ref-need deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130089 (https://phabricator.wikimedia.org/T387019) [13:34:44] (03PS2) 10Bking: data-platform: Detune RdfStreamingUpdaterSpaceUsageTooHigh [alerts] - 10https://gerrit.wikimedia.org/r/1130114 (https://phabricator.wikimedia.org/T387920) [13:35:42] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1117.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:36:28] (03CR) 10Bearloga: [C:03+1] "Thanks for updating!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129270 (owner: 10Phuedx) [13:36:55] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1117.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:42:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q2:install SSD (hot swap additions) to ml-lab100[12] - https://phabricator.wikimedia.org/T381394#10661893 (10VRiley-WMF) [13:42:29] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1111.eqiad.wmnet with reason: host reimage [13:42:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q2:install SSD (hot swap additions) to ml-lab100[12] - https://phabricator.wikimedia.org/T381394#10661899 (10VRiley-WMF) @klausman This has been completed and the drives have been added. Is there anything additional we may need to do on our end? [13:43:46] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1117.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:44:01] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130106 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:44:16] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1118.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:45:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1111.eqiad.wmnet with reason: host reimage [13:46:12] (03CR) 10Brouberol: [C:03+1] data-platform: Detune RdfStreamingUpdaterSpaceUsageTooHigh [alerts] - 10https://gerrit.wikimedia.org/r/1130114 (https://phabricator.wikimedia.org/T387920) (owner: 10Bking) [13:47:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q2:install SSD (hot swap additions) to ml-lab100[12] - https://phabricator.wikimedia.org/T381394#10661916 (10klausman) 05Open→03Resolved a:03klausman >>! In T381394#10661893, @VRiley-WMF wrote: > @klausman This has been completed and the driv... [13:47:56] (03CR) 10Bking: [C:03+2] data-platform: Detune RdfStreamingUpdaterSpaceUsageTooHigh [alerts] - 10https://gerrit.wikimedia.org/r/1130114 (https://phabricator.wikimedia.org/T387920) (owner: 10Bking) [13:48:38] (03PS1) 10Gergő Tisza: Enable SUL3 login for all group 1 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130121 (https://phabricator.wikimedia.org/T384153) [13:48:40] (03PS1) 10Gergő Tisza: Enable SUL3 login for 1% of group 2 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130122 (https://phabricator.wikimedia.org/T384219) [13:49:04] !log bootstrapping restbase1043-c/cassandra — T389423 [13:49:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130121 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [13:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:07] T389423: Refresh restbase10[28-30] w/ restbase104[3-5] - https://phabricator.wikimedia.org/T389423 [13:49:16] !log klausman@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: apply [13:49:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130122 (https://phabricator.wikimedia.org/T384219) (owner: 10Gergő Tisza) [13:49:28] (03CR) 10CI reject: [V:04-1] Enable SUL3 login for 1% of group 2 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130122 (https://phabricator.wikimedia.org/T384219) (owner: 10Gergő Tisza) [13:50:04] (03CR) 10Klausman: [C:03+1] ml-services: udpate ml-staging ref-need deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130089 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [13:50:11] !log klausman@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [13:53:27] !log klausman@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: apply [13:53:38] !log sukhe@cumin1002 START - Cookbook sre.network.cf [13:53:39] !log sukhe@cumin1002 END (PASS) - Cookbook sre.network.cf (exit_code=0) [13:54:10] !log klausman@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [13:55:07] (03PS1) 10Federico Ceratto: clone.py: switch to using pool/depool cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1127022 (https://phabricator.wikimedia.org/T388383) [13:55:07] (03CR) 10Federico Ceratto: "Ready for CR" [cookbooks] - 10https://gerrit.wikimedia.org/r/1127022 (https://phabricator.wikimedia.org/T388383) (owner: 10Federico Ceratto) [13:55:58] (03PS1) 10Ahmon Dancy: scap.cfg.erb: Fix spiderpig_auth_server url [puppet] - 10https://gerrit.wikimedia.org/r/1130125 (https://phabricator.wikimedia.org/T383947) [13:56:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1118.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:56:51] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1118.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:57:32] (03PS4) 10Ilias Sarantopoulos: ml-services: enable multiprocessing for reference-need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130089 (https://phabricator.wikimedia.org/T387019) [13:58:20] (03CR) 10Muehlenhoff: [C:03+1] "LGTM, merging" [puppet] - 10https://gerrit.wikimedia.org/r/1130125 (https://phabricator.wikimedia.org/T383947) (owner: 10Ahmon Dancy) [13:58:24] (03CR) 10Muehlenhoff: [C:03+2] scap.cfg.erb: Fix spiderpig_auth_server url [puppet] - 10https://gerrit.wikimedia.org/r/1130125 (https://phabricator.wikimedia.org/T383947) (owner: 10Ahmon Dancy) [13:58:37] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [13:58:54] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [13:58:55] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1111.eqiad.wmnet with OS bullseye [13:59:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10661975 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host elastic1111.eqiad.wmnet with OS bu... [13:59:16] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host elastic1112.eqiad.wmnet with OS bullseye [13:59:22] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10661985 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host elastic1112.eqiad.wmnet with O... [13:59:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10661986 (10phaultfinder) [13:59:54] !log T389589 Ran mwscript-k8s --comment="T389589" -f -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=zhwiki --logwiki=metawiki 'Pinnasalvatore80' 'Diana 79' [13:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:59] T389589: Unblock stuck global renames - https://phabricator.wikimedia.org/T389589 [14:00:43] !log T389589 Ran mwscript-k8s --comment="T389589" -f -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=bewiki --logwiki=metawiki 'Daanschr' 'Daan Schrama' [14:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:41] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:02:09] (03CR) 10Ssingh: [C:03+1] "Leaving to the netops expertise but it's a good idea." [homer/public] - 10https://gerrit.wikimedia.org/r/1130095 (https://phabricator.wikimedia.org/T389606) (owner: 10Cathal Mooney) [14:02:10] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host elastic1113.eqiad.wmnet with OS bullseye [14:02:22] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10662000 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host elastic1113.eqiad.wmnet with O... [14:03:27] (03PS1) 10Santiago Faci: [Experiment Platform] Disable test experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130128 (https://phabricator.wikimedia.org/T383801) [14:03:41] (03CR) 10Klausman: ml-services: enable multiprocessing for reference-need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130089 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [14:03:45] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1118.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:04:55] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1120.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:05:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10662018 (10phaultfinder) [14:07:38] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:07:51] (03CR) 10Phuedx: [C:03+1] [Experiment Platform] Disable test experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130128 (https://phabricator.wikimedia.org/T383801) (owner: 10Santiago Faci) [14:10:04] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1112.eqiad.wmnet with reason: host reimage [14:10:28] (03CR) 10Elukey: [C:03+1] Stop including bullseye-backports on Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/1130082 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [14:12:44] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1113.eqiad.wmnet with reason: host reimage [14:13:22] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:13:41] (03PS5) 10Ilias Sarantopoulos: ml-services: enable multiprocessing for reference-need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130089 (https://phabricator.wikimedia.org/T387019) [14:13:43] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1112.eqiad.wmnet with reason: host reimage [14:15:39] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:16:17] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:16:26] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1120.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:16:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1113.eqiad.wmnet with reason: host reimage [14:17:26] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:17:33] (03CR) 10Gkyziridis: [C:03+1] "Thnx Ilias. LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130089 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [14:18:11] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:18:12] 10SRE-swift-storage, 06Commons, 07Wikimedia-production-error: An unknown error occurred in storage backend "local-swift-eqiad" - https://phabricator.wikimedia.org/T389538#10662060 (10MatthewVernon) @Yann looking at the log, you were able to delete this file (at about 19:28 on 2025-03-20), and it's subsequent... [14:18:22] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:18:43] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1120.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:19:47] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: enable multiprocessing for reference-need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130089 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [14:20:07] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2251 to codfw - jhancock@cumin2002" [14:20:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2251 to codfw - jhancock@cumin2002" [14:20:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:21:11] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl2004 [14:21:15] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2251 [14:21:19] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2252 [14:21:20] (03Merged) 10jenkins-bot: ml-services: enable multiprocessing for reference-need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130089 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [14:21:21] (03CR) 10Elukey: "I have two questions:" [puppet] - 10https://gerrit.wikimedia.org/r/1130106 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:21:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl2004 [14:21:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2251 [14:21:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2252 [14:22:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:22:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2251.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:22:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2252.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:22:20] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2252.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:23:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2252.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:23:20] (03PS1) 10Elukey: role::ml_k8s::worker: move ml-serve2004 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1130131 (https://phabricator.wikimedia.org/T387854) [14:23:25] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:23:43] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: edit-check GPU version experimental ns deployment on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129285 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [14:24:53] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1120.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:25:32] !log isaranto@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:25:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10662087 (10phaultfinder) [14:27:07] (03PS6) 10Gkyziridis: ml-services: edit-check GPU version experimental ns deployment on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129285 (https://phabricator.wikimedia.org/T386100) [14:27:13] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:27:36] (03PS4) 10Alexandros Kosiaris: profile::mediawiki::system_users: Create spiderpig user [puppet] - 10https://gerrit.wikimedia.org/r/1129389 (owner: 10Ahmon Dancy) [14:27:50] (03CR) 10Ilias Sarantopoulos: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129285 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [14:28:55] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:28:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1112.eqiad.wmnet with OS bullseye [14:29:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10662092 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host elastic1112.eqiad.wmnet with OS bu... [14:29:13] !log isaranto@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [14:29:22] (03CR) 10Alexandros Kosiaris: [C:03+2] "I 've discussed this with Ahmon today and I had made a wrong assumption here. This is indeed the only way out for now, at least in a way t" [puppet] - 10https://gerrit.wikimedia.org/r/1129389 (owner: 10Ahmon Dancy) [14:29:28] (03Merged) 10jenkins-bot: ml-services: edit-check GPU version experimental ns deployment on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129285 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [14:29:35] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host elastic1114.eqiad.wmnet with OS bullseye [14:29:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10662095 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host elastic1114.eqiad.wmnet with O... [14:30:23] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:30:40] !log isaranto@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:30:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:30:43] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1113.eqiad.wmnet with OS bullseye [14:30:51] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10662114 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host elastic1113.eqiad.wmnet with OS bu... [14:31:29] !log isaranto@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . [14:31:30] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host elastic1115.eqiad.wmnet with OS bullseye [14:31:36] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129389 (owner: 10Ahmon Dancy) [14:31:42] !log isaranto@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [14:31:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10662121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host elastic1115.eqiad.wmnet with O... [14:32:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2251.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:33:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:33:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2252.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:34:25] (03PS1) 10Ssingh: sre.network.cf: log if no changes were made [cookbooks] - 10https://gerrit.wikimedia.org/r/1130135 [14:35:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:35:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2251.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:36:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2252.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:36:08] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2252.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:36:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2252.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:38:08] (03PS2) 10Ssingh: sre.network.cf: log if no changes were made [cookbooks] - 10https://gerrit.wikimedia.org/r/1130135 [14:38:40] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1121.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:38:43] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1122.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:38:49] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1121.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:39:45] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1122.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:39:58] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1114.eqiad.wmnet with reason: host reimage [14:41:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:41:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2251.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:41:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2252.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:41:59] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1121.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:42:00] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1115.eqiad.wmnet with reason: host reimage [14:43:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2004.codfw.wmnet with OS bookworm [14:43:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2249.codfw.wmnet with OS bookworm [14:43:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2250.codfw.wmnet with OS bookworm [14:43:15] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10662183 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-ctrl2004.codfw.wmnet with O... [14:43:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2251.codfw.wmnet with OS bookworm [14:43:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2252.codfw.wmnet with OS bookworm [14:43:19] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10662187 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2249.codfw.wmnet with... [14:43:22] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10662188 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2250.codfw.wmnet with... [14:43:25] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10662189 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2251.codfw.wmnet with... [14:43:26] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10662190 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2252.codfw.wmnet with... [14:43:34] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1114.eqiad.wmnet with reason: host reimage [14:43:44] (03CR) 10Slyngshede: P:idm add logstash to requestable permission (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114949 (owner: 10Slyngshede) [14:44:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10662191 (10phaultfinder) [14:46:28] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1115.eqiad.wmnet with reason: host reimage [14:46:40] (03PS1) 10Ssingh: P:dns::auth: explicitly log to SAL if authdns-update run failed [puppet] - 10https://gerrit.wikimedia.org/r/1130136 [14:47:09] (03PS1) 10Aqu: Analytics: Depecate wmf.webrequest data purge [puppet] - 10https://gerrit.wikimedia.org/r/1130137 (https://phabricator.wikimedia.org/T387750) [14:47:28] FIRING: SystemdUnitFailed: spiderpig-jobrunner.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:47:45] (03PS1) 10Ahmon Dancy: data.yaml: Allow journalctl access to spiderpig services [puppet] - 10https://gerrit.wikimedia.org/r/1130138 (https://phabricator.wikimedia.org/T383945) [14:47:48] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host elastic1116.eqiad.wmnet with OS bullseye [14:47:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10662206 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host elastic1116.eqiad.wmnet with O... [14:48:32] (03PS2) 10Ahmon Dancy: data.yaml: Allow journalctl access to spiderpig services [puppet] - 10https://gerrit.wikimedia.org/r/1130138 (https://phabricator.wikimedia.org/T383945) [14:48:36] (03CR) 10Fabfur: [C:03+1] P:dns::auth: explicitly log to SAL if authdns-update run failed [puppet] - 10https://gerrit.wikimedia.org/r/1130136 (owner: 10Ssingh) [14:48:48] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130138 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [14:49:23] (03PS2) 10Aqu: WIP Analytics: Depecate wmf.webrequest data purge [puppet] - 10https://gerrit.wikimedia.org/r/1130137 (https://phabricator.wikimedia.org/T387750) [14:51:34] (03CR) 10CI reject: [V:04-1] WIP Analytics: Depecate wmf.webrequest data purge [puppet] - 10https://gerrit.wikimedia.org/r/1130137 (https://phabricator.wikimedia.org/T387750) (owner: 10Aqu) [14:51:34] (03CR) 10Ssingh: [C:03+2] P:dns::auth: explicitly log to SAL if authdns-update run failed [puppet] - 10https://gerrit.wikimedia.org/r/1130136 (owner: 10Ssingh) [14:52:42] (03PS5) 10Alexandros Kosiaris: profile::mediawiki::system_users: Create spiderpig user [puppet] - 10https://gerrit.wikimedia.org/r/1129389 (owner: 10Ahmon Dancy) [14:52:55] !log sudo cumin 'A:dnsbox' 'run-puppet-agent' [14:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:18] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1121.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:54:11] (03CR) 10Elukey: [C:03+2] role::ml_k8s::worker: move ml-serve2004 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1130131 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [14:54:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2250.codfw.wmnet with reason: host reimage [14:54:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl2004.codfw.wmnet with reason: host reimage [14:54:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2252.codfw.wmnet with reason: host reimage [14:55:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2251.codfw.wmnet with reason: host reimage [14:55:23] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129389 (owner: 10Ahmon Dancy) [14:55:26] !log sukhe@dns1005 START - running authdns-update [14:55:35] !log resuming firmware updates on cp4047 via T387238 [14:55:38] !log testing dummy authdns-update run [14:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:39] T387238: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238 [14:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:50] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4047.ulsfo.wmnet [14:56:21] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp4047.ulsfo.wmnet [14:56:46] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4047.ulsfo.wmnet [14:56:56] !log sukhe@dns1005 END - running authdns-update [14:57:13] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:57:15] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1121.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:57:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:57:52] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1114.eqiad.wmnet with OS bullseye [14:58:01] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10662274 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host elastic1114.eqiad.wmnet with OS bu... [14:58:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2250.codfw.wmnet with reason: host reimage [14:58:10] !log klausman@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [14:58:26] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1116.eqiad.wmnet with reason: host reimage [14:58:29] !log klausman@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [14:58:41] FIRING: EtcdReplicationDown: etcd replication down on conf2005:8000 #page - https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication - TODO - https://alerts.wikimedia.org/?q=alertname%3DEtcdReplicationDown [14:58:50] !incidents [14:58:51] 5766 (UNACKED) EtcdReplicationDown etcd sre (conf2005:8000 etcdmirror codfw) [14:58:51] 5765 (RESOLVED) DDoSDetected sre (netflow3003:9100 esams) [14:59:01] !log robh@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4047.ulsfo.wmnet [14:59:08] !ack 5766 [14:59:08] 5766 (ACKED) EtcdReplicationDown etcd sre (conf2005:8000 etcdmirror codfw) [14:59:27] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve2004.codfw.wmnet with OS bookworm [14:59:29] * akosiaris looking [14:59:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10662280 (10phaultfinder) [14:59:47] here as well if you need eyes / hands o/ [14:59:51] !log elukey@cumin1002 START - Cookbook sre.hosts.move-vlan for host ml-serve2004 [14:59:55] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:59:59] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [15:00:13] CRITICAL: Generic error: Raft Internal Error : etcdserver: request timed out, possibly due to previous leader failure [15:00:17] hmmm, doesn't look good [15:00:26] I ll bounce it once [15:01:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2252.codfw.wmnet with reason: host reimage [15:01:12] well, that apparently worked [15:02:03] reading through the journal - yeah, seems to have picked up where it left off cleanly [15:02:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:02:16] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1115.eqiad.wmnet with OS bullseye [15:02:26] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10662309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host elastic1115.eqiad.wmnet with OS bu... [15:02:28] FIRING: [2x] SystemdUnitFailed: etcdmirror--eqiad-wmnet.service on conf2005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:03:14] while not "high" in an absolute sense, wow that's a lot of spicerack lock write traffic [15:03:26] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1121.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:03:41] RESOLVED: EtcdReplicationDown: etcd replication down on conf2005:8000 #page - https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication - TODO - https://alerts.wikimedia.org/?q=alertname%3DEtcdReplicationDown [15:03:51] which makes sense of course given the number of reimages etc. going on, but still [15:04:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl2004.codfw.wmnet with reason: host reimage [15:05:51] (03PS1) 10Elukey: WIP - sre.hosts.provision: try Supermicro BMC passwords automatically [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) [15:05:52] (03CR) 10Alexandros Kosiaris: [C:03+2] "PCC errors looks like a fluke. Just on puppet 7, and for a fail fast for deploy1003" [puppet] - 10https://gerrit.wikimedia.org/r/1129389 (owner: 10Ahmon Dancy) [15:06:29] swfrench-wmf: I was noticing the same thing [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:16] (03CR) 10Alexandros Kosiaris: [C:03+2] Revert "admin: remove spiderpig from deployment group" [puppet] - 10https://gerrit.wikimedia.org/r/1129408 (owner: 10Ahmon Dancy) [15:07:17] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ml-serve2004 - elukey@cumin1002" [15:07:18] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1122.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:07:23] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ml-serve2004 - elukey@cumin1002" [15:07:23] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:07:23] !log elukey@cumin1002 START - Cookbook sre.dns.wipe-cache ml-serve2004.codfw.wmnet 11.48.192.10.in-addr.arpa 1.1.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:07:24] (03PS2) 10Ahmon Dancy: Revert "admin: remove spiderpig from deployment group" [puppet] - 10https://gerrit.wikimedia.org/r/1129408 [15:07:26] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ml-serve2004.codfw.wmnet 11.48.192.10.in-addr.arpa 1.1.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:07:27] !log elukey@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ml-serve2004 [15:08:06] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1116.eqiad.wmnet with reason: host reimage [15:08:43] (03CR) 10Alexandros Kosiaris: [C:03+2] Revert "admin: remove spiderpig from deployment group" [puppet] - 10https://gerrit.wikimedia.org/r/1129408 (owner: 10Ahmon Dancy) [15:08:56] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:08:58] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:09:09] 10SRE-swift-storage, 06Commons, 07Wikimedia-production-error: An unknown error occurred in storage backend "local-swift-eqiad" - https://phabricator.wikimedia.org/T389538#10662334 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon I'm afraid the logs aren't very helpful here - I can see the PUT... [15:09:22] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:09:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10662339 (10phaultfinder) [15:09:57] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:10:36] FIRING: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [15:10:46] !incidents [15:10:46] 5766 (RESOLVED) EtcdReplicationDown etcd sre (conf2005:8000 etcdmirror codfw) [15:10:47] 5765 (RESOLVED) DDoSDetected sre (netflow3003:9100 esams) [15:10:54] !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cp4047.ulsfo.wmnet [15:10:58] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp4047.ulsfo.wmnet [15:11:01] ah, just IRC pag.e [15:11:02] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4047.ulsfo.wmnet [15:11:39] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host elastic1117.eqiad.wmnet with OS bullseye [15:11:40] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host elastic1118.eqiad.wmnet with OS bullseye [15:11:47] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host elastic1120.eqiad.wmnet with OS bullseye [15:11:51] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10662341 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host elastic1117.eqiad.wmnet with O... [15:11:52] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10662342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host elastic1118.eqiad.wmnet with O... [15:11:52] indeed quite a bit of error compared to the usual 0rps [15:11:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10662343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host elastic1120.eqiad.wmnet with O... [15:11:58] it's at 12rps [15:12:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2251.codfw.wmnet with reason: host reimage [15:12:14] * akosiaris looking [15:12:24] !log robh@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4047.ulsfo.wmnet [15:12:48] !log elukey@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-serve2004 [15:12:48] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ml-serve2004 [15:13:43] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:14:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:14:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2250.codfw.wmnet with OS bookworm [15:14:34] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:14:40] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10662358 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2250.codfw.wmnet with OS... [15:16:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2249.codfw.wmnet with reason: host reimage [15:16:39] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:16:43] (03PS2) 10Elukey: WIP - sre.hosts.provision: try Supermicro BMC passwords automatically [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) [15:16:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:16:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2252.codfw.wmnet with OS bookworm [15:17:01] (03PS3) 10Elukey: WIP - sre.hosts.provision: try Supermicro BMC passwords automatically [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) [15:17:09] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10662387 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2252.codfw.wmnet with OS... [15:17:20] (03PS3) 10Giuseppe Lavagetto: aptrepo: remove old keys [puppet] - 10https://gerrit.wikimedia.org/r/1097332 [15:17:39] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:17:57] (03PS1) 10Fabfur: First proposal to commit vendored dependencies [debs/benthos] - 10https://gerrit.wikimedia.org/r/1130141 (https://phabricator.wikimedia.org/T388261) [15:18:09] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:18:21] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [15:18:22] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2327 to codfw - jhancock@cumin2002" [15:18:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2327 to codfw - jhancock@cumin2002" [15:18:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:18:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [15:19:10] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:19:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1122.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:19:39] RESOLVED: [2x] SystemdUnitFailed: etcdmirror--eqiad-wmnet.service on conf2005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:19:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2249.codfw.wmnet with reason: host reimage [15:19:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:19:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-ctrl2004.codfw.wmnet with OS bookworm [15:19:55] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host elastic1122.eqiad.wmnet with OS bullseye [15:20:02] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10662398 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-ctrl2004.codfw.wmnet with OS bo... [15:20:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10662399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host elastic1122.eqiad.wmnet with O... [15:20:22] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5133/console" [puppet] - 10https://gerrit.wikimedia.org/r/1097332 (owner: 10Giuseppe Lavagetto) [15:20:47] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10662400 (10phaultfinder) [15:20:57] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1121.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:21:30] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:22:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:22:01] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1121.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:22:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1116.eqiad.wmnet with OS bullseye [15:22:07] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1118.eqiad.wmnet with reason: host reimage [15:22:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10662401 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host elastic1116.eqiad.wmnet with OS bu... [15:22:18] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1117.eqiad.wmnet with reason: host reimage [15:22:27] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1120.eqiad.wmnet with reason: host reimage [15:23:22] (03PS4) 10Elukey: WIP - sre.hosts.provision: try Supermicro BMC passwords automatically [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) [15:24:17] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:24:56] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:25:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1118.eqiad.wmnet with reason: host reimage [15:25:18] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:26:50] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [15:27:03] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:27:05] klausman: There is a page for the api-gateway getting too many 504 from the rate limit cluster. The graphs, https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway, coincide with https://sal.toolforge.org/log/Ipo2uZUBffdvpiTrdmWt [15:27:11] should we revert? [15:27:23] (03PS5) 10Elukey: WIP - sre.hosts.provision: try Supermicro BMC passwords automatically [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) [15:27:32] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:27:43] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:27:43] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:27:49] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1117.eqiad.wmnet with reason: host reimage [15:27:55] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:27:55] alert is at https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh&q=envoy_cluster_name%3Drate_limit_cluster for what is worth [15:28:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:28:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2251.codfw.wmnet with OS bookworm [15:28:11] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10662447 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2251.codfw.wmnet with OS... [15:28:14] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10662449 (10RobH) The iDrac and BIOS firmware versions have been (incrementally) updated to the newest versions of each. The error has cleared out during this process, which is what support was counting... [15:28:39] (03PS6) 10Elukey: WIP - sre.hosts.provision: try Supermicro BMC passwords automatically [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) [15:28:59] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:29:06] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:29:32] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns for cloudsw1-c8-eqiad cloud-private vrf loopback - cmooney@cumin1002" [15:29:37] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns for cloudsw1-c8-eqiad cloud-private vrf loopback - cmooney@cumin1002" [15:29:37] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:29:43] (03PS1) 10Cathal Mooney: Add include statement for WMCS Eqiad reverse IPv6 snippet [dns] - 10https://gerrit.wikimedia.org/r/1130143 (https://phabricator.wikimedia.org/T379283) [15:30:01] (03PS7) 10Elukey: WIP - sre.hosts.provision: try Supermicro BMC passwords automatically [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) [15:30:17] akosiaris: I don't think so [15:30:33] Let me do some digging [15:30:35] (03PS8) 10Elukey: WIP - sre.hosts.provision: try Supermicro BMC passwords automatically [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) [15:30:53] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:30:58] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1120.eqiad.wmnet with reason: host reimage [15:31:11] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:32:00] I am unsure what that alert signifies in the context of the API GW. The change I made should basically just throw 429s for that staging services if the quota is exhausted [15:32:10] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:32:18] hnowlan: ping? [15:32:27] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cp4047.ulsfo.wmnet [15:32:47] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp4047.ulsfo.wmnet [15:33:01] klausman: he isn't around today. And it's friday afternoon in the EU [15:33:13] I 'd suggest a revert and revisiting on Monday [15:33:54] alright, yeah [15:33:55] (03CR) 10Ssingh: [C:03+1] Add include statement for WMCS Eqiad reverse IPv6 snippet [dns] - 10https://gerrit.wikimedia.org/r/1130143 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney) [15:34:42] (03PS1) 10Klausman: Revert "api-gateway: allow anonymous requests to edit-check on ml-staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130144 [15:34:48] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1121.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:34:54] (03PS9) 10Elukey: WIP - sre.hosts.provision: try Supermicro BMC passwords automatically [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) [15:35:05] (03CR) 10Cathal Mooney: [C:03+2] Add include statement for WMCS Eqiad reverse IPv6 snippet [dns] - 10https://gerrit.wikimedia.org/r/1130143 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney) [15:35:10] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:35:40] !log cmooney@dns2005 START - running authdns-update [15:36:03] (03CR) 10Ilias Sarantopoulos: [C:03+1] Revert "api-gateway: allow anonymous requests to edit-check on ml-staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130144 (owner: 10Klausman) [15:36:17] (03CR) 10Alexandros Kosiaris: [C:03+1] Revert "api-gateway: allow anonymous requests to edit-check on ml-staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130144 (owner: 10Klausman) [15:36:32] (03CR) 10Klausman: [C:03+2] Revert "api-gateway: allow anonymous requests to edit-check on ml-staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130144 (owner: 10Klausman) [15:36:34] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1121.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:45] FIRING: MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [15:37:04] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:37:32] !log cmooney@dns2005 END - running authdns-update [15:38:22] (03PS2) 10Jforrester: Enable SUL3 login for 1% of group 2 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130122 (https://phabricator.wikimedia.org/T384219) (owner: 10Gergő Tisza) [15:38:22] (03Merged) 10jenkins-bot: Revert "api-gateway: allow anonymous requests to edit-check on ml-staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130144 (owner: 10Klausman) [15:38:26] (03CR) 10Jforrester: Enable SUL3 login for 1% of group 2 users (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130122 (https://phabricator.wikimedia.org/T384219) (owner: 10Gergő Tisza) [15:39:18] !log klausman@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [15:39:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10662500 (10bking) Hello DC Ops, I just found T356919 , which is the same host having th... [15:39:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:39:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2249.codfw.wmnet with OS bookworm [15:39:42] !log klausman@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [15:39:43] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10662504 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2249.codfw.wmnet with OS... [15:39:45] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:39:47] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:40:03] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:40:04] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1118.eqiad.wmnet with OS bullseye [15:40:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10662507 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host elastic1118.eqiad.wmnet with OS bu... [15:40:11] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host elastic1121.eqiad.wmnet with OS bullseye [15:40:17] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10662508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host elastic1121.eqiad.wmnet with O... [15:41:02] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:41:06] 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632 (10RobH) 03NEW [15:41:09] merged and deployed, will do codfw and staging in a sec. graph is already dropping [15:41:10] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:41:28] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:41:28] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1117.eqiad.wmnet with OS bullseye [15:41:35] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10662531 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host elastic1117.eqiad.wmnet with OS bu... [15:41:37] ...and goign back up [15:41:45] FIRING: [2x] MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [15:41:48] 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10662544 (10RobH) a:03MatthewVernon @MatthewVernon, Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team rec... [15:42:02] !log klausman@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: apply [15:42:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10662550 (10Jclark-ctr) [15:42:17] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:42:18] !log klausman@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [15:42:25] !log klausman@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: apply [15:42:35] !log klausman@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [15:42:48] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10662551 (10RobH) a:03MatthewVernon Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving... [15:43:17] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe201[56] - https://phabricator.wikimedia.org/T388887#10662553 (10RobH) a:03MatthewVernon Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving... [15:43:39] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10662555 (10Jclark-ctr) [15:43:41] 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10662557 (10RobH) [15:43:59] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:44:48] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10662558 (10phaultfinder) [15:45:50] akosiaris: the revert does not seem to have reduced the 504 rate [15:46:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10662561 (10bking) [15:46:45] FIRING: [2x] MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [15:47:42] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:47:43] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1120.eqiad.wmnet with OS bullseye [15:47:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10662570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host elastic1120.eqiad.wmnet with OS bu... [15:48:42] (03PS10) 10Elukey: WIP - sre.hosts.provision: try Supermicro BMC passwords automatically [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) [15:50:42] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1121.eqiad.wmnet with reason: host reimage [15:51:45] RESOLVED: [2x] MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [15:52:22] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10662581 (10Jclark-ctr) [15:52:43] 10SRE-swift-storage, 10MediaWiki-File-management, 10MediaWiki-Uploading: Swift file replicated to codfw but not eqiad - https://phabricator.wikimedia.org/T389539#10662583 (10MatthewVernon) Both of these were uploaded during [[ https://www.wikimediastatus.net/incidents/xc1n0wbml3n4 | an incident affecting ima... [15:52:52] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10662585 (10Jclark-ctr) [15:53:14] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1121.eqiad.wmnet with reason: host reimage [15:53:45] FIRING: MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [15:53:49] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10662586 (10Jhancock.wm) [15:53:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10662588 (10bking) [15:54:30] (03PS11) 10Elukey: WIP - sre.hosts.provision: try Supermicro BMC passwords automatically [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) [15:54:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10662595 (10phaultfinder) [15:56:30] (03PS1) 10Bking: elasticsearch/relforge: rename 3 elastic hosts to relforge [puppet] - 10https://gerrit.wikimedia.org/r/1130145 (https://phabricator.wikimedia.org/T384966) [15:56:44] (03PS12) 10Elukey: WIP - sre.hosts.provision: try Supermicro BMC passwords automatically [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) [15:56:54] (03CR) 10CI reject: [V:04-1] elasticsearch/relforge: rename 3 elastic hosts to relforge [puppet] - 10https://gerrit.wikimedia.org/r/1130145 (https://phabricator.wikimedia.org/T384966) (owner: 10Bking) [15:57:31] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1122.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:58:45] FIRING: [2x] MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [15:58:59] (03PS2) 10Bking: elasticsearch/relforge: rename 3 elastic hosts to relforge [puppet] - 10https://gerrit.wikimedia.org/r/1130145 (https://phabricator.wikimedia.org/T384966) [15:59:22] (03CR) 10CI reject: [V:04-1] elasticsearch/relforge: rename 3 elastic hosts to relforge [puppet] - 10https://gerrit.wikimedia.org/r/1130145 (https://phabricator.wikimedia.org/T384966) (owner: 10Bking) [15:59:24] klausman: indeed [15:59:28] damn [15:59:45] it might be that my change wasn't bad, but that this was triggered by the restart [15:59:58] I had a look at external traffic btw, no changes that would explain it easily [16:00:08] could be [16:00:37] having a cluster called rate_limit doesn't make it less confusing [16:01:19] it's envoy terminology [16:01:30] (03CR) 10Gergő Tisza: Enable SUL3 login for 1% of group 2 users (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130122 (https://phabricator.wikimedia.org/T384219) (owner: 10Gergő Tisza) [16:01:44] not disagreeing, but not sure what we would name it [16:02:40] (03CR) 10CI reject: [V:04-1] WIP - sre.hosts.provision: try Supermicro BMC passwords automatically [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) (owner: 10Elukey) [16:02:52] /a [16:02:58] /b [16:02:59] etc [16:03:09] lol, been there done that [16:03:18] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe200[5-7] - https://phabricator.wikimedia.org/T389634 (10RobH) 03NEW [16:03:39] akosiaris: I _think_ since the hour has now ticked over, the rate may be going down (apigw ratelimits are per-hour) [16:03:42] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1122.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:03:45] RESOLVED: [2x] MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [16:03:48] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe200[5-7] - https://phabricator.wikimedia.org/T389634#10662656 (10RobH) [16:03:58] So my change might have been bad, it just took the tick-over to materialize? [16:04:19] (03PS3) 10Bking: elasticsearch/relforge: rename 3 elastic hosts to relforge [puppet] - 10https://gerrit.wikimedia.org/r/1130145 (https://phabricator.wikimedia.org/T384966) [16:04:21] uhm [16:04:29] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve2004.codfw.wmnet with OS bookworm [16:04:34] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe200[5-7] - https://phabricator.wikimedia.org/T389634#10662658 (10RobH) a:03MatthewVernon Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the n... [16:04:46] that would be confusing to say the least [16:04:52] (03PS4) 10Bking: elasticsearch/relforge: rename 3 elastic hosts to relforge [puppet] - 10https://gerrit.wikimedia.org/r/1130145 (https://phabricator.wikimedia.org/T384966) [16:04:53] Yeah, agreed. [16:05:00] but it is indeed rapidly dropping [16:05:04] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve2004.codfw.wmnet with OS bookworm [16:05:11] which for a Friday afternoon, is good enough for me [16:05:20] but what on earth? [16:05:36] RESOLVED: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [16:05:52] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635 (10RobH) 03NEW [16:06:01] well, alert cleared [16:06:04] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10662679 (10RobH) [16:06:09] akosiaris: I dunno. But I vaguely rmemer apigw docs that say that apigw usage is reste on the hour [16:06:26] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10662683 (10RobH) a:03MatthewVernon Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the n... [16:06:42] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:06:47] klausman: could be. Let's retry this on Monday and see what happens [16:07:28] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:07:29] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1121.eqiad.wmnet with OS bullseye [16:07:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10662702 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 fo... [16:07:51] akosiaris: ack. and thanks for the ping & help [16:08:05] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:08:42] (03PS5) 10Bking: elasticsearch/relforge: rename 3 elastic hosts to relforge [puppet] - 10https://gerrit.wikimedia.org/r/1130145 (https://phabricator.wikimedia.org/T384966) [16:08:44] klausman: thanks as well. Have a nice weekend! [16:09:34] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10662709 (10Jclark-ctr) [16:09:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10662712 (10phaultfinder) [16:10:02] 10SRE-swift-storage, 10MediaWiki-File-management, 10MediaWiki-Uploading: Swift file replicated to codfw but not eqiad - https://phabricator.wikimedia.org/T389539#10662717 (10MatthewVernon) OK, turning to Emma_Müller_Edle_von_Seehof_Bub_mit_Federhut.jpg first, it's present in codfw and not in eqiad, and that... [16:10:13] (03CR) 10Alexandros Kosiaris: [C:03+1] data.yaml: Allow journalctl access to spiderpig services [puppet] - 10https://gerrit.wikimedia.org/r/1130138 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [16:10:23] (03CR) 10Bking: [C:03+2] elasticsearch/relforge: rename 3 elastic hosts to relforge [puppet] - 10https://gerrit.wikimedia.org/r/1130145 (https://phabricator.wikimedia.org/T384966) (owner: 10Bking) [16:10:44] (03CR) 10Bking: [C:03+2] "self-merging to speed up provisioning" [puppet] - 10https://gerrit.wikimedia.org/r/1130145 (https://phabricator.wikimedia.org/T384966) (owner: 10Bking) [16:11:59] 10SRE-swift-storage, 10MediaWiki-File-management, 10MediaWiki-Uploading: Upload stack fails to upload to both swift clusters or inform uploader of said failure - https://phabricator.wikimedia.org/T389539#10662738 (10MatthewVernon) [16:12:16] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2253-8 to codfw - jhancock@cumin2002" [16:12:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2253-8 to codfw - jhancock@cumin2002" [16:12:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:14:14] (03PS13) 10Elukey: WIP - sre.hosts.provision: try Supermicro BMC passwords automatically [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) [16:14:18] (03PS1) 10Superpes15: [kowikiquote] Change the logo and wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130147 (https://phabricator.wikimedia.org/T389631) [16:14:43] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2327.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:16:30] (03CR) 10Pppery: "Well, feel free to try, I guess." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129795 (https://phabricator.wikimedia.org/T380911) (owner: 10Zoe) [16:17:45] (03CR) 10Pppery: "Leaving to the SRE team on whether this is a good idea without any recommendations - one could make a https://www.w3.org/Provider/Style/UR" [puppet] - 10https://gerrit.wikimedia.org/r/1130096 (https://phabricator.wikimedia.org/T307965) (owner: 10Aklapper) [16:18:20] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:18:21] (03CR) 10BCornwall: [C:03+2] upgrade cp3079 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129865 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:19:48] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp3079.esams.wmnet} and A:cp [16:21:16] (03CR) 10CI reject: [V:04-1] WIP - sre.hosts.provision: try Supermicro BMC passwords automatically [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) (owner: 10Elukey) [16:22:39] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2253-8 to codfw - jhancock@cumin2002" [16:22:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2253-8 to codfw - jhancock@cumin2002" [16:22:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:23:19] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2253 [16:23:21] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2254 [16:23:22] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2255 [16:23:23] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2256 [16:23:24] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2257 [16:23:25] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2258 [16:23:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2253 [16:23:31] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host wikikube-worker2255 [16:23:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2254 [16:23:36] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host wikikube-worker2257 [16:23:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2256 [16:23:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2258 [16:23:42] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2257 [16:23:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2257 [16:23:54] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2255 [16:24:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2255 [16:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10662789 (10phaultfinder) [16:25:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2253.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:25:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2254.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:25:13] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp3079.esams.wmnet} and A:cp [16:25:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2255.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:25:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2256.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:25:19] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2327.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:25:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2257.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:25:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2258.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:25:31] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2255.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:25:34] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2257.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:25:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2257.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:25:54] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2327.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:25:54] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2327.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:26:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2255.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:26:37] (03PS14) 10Elukey: WIP - sre.hosts.provision: try Supermicro BMC passwords automatically [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) [16:26:52] (03PS1) 10Bking: relforge: set partitioning scheme for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1130150 (https://phabricator.wikimedia.org/T384966) [16:26:55] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2327.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:28:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10662830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 fo... [16:29:06] (03CR) 10Bking: [C:03+2] relforge: set partitioning scheme for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1130150 (https://phabricator.wikimedia.org/T384966) (owner: 10Bking) [16:29:27] (03CR) 10Bking: [C:03+2] "self-merging to prevent failed reimages." [puppet] - 10https://gerrit.wikimedia.org/r/1130150 (https://phabricator.wikimedia.org/T384966) (owner: 10Bking) [16:29:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10662837 (10phaultfinder) [16:32:09] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2327.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:32:48] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host elastic1122.eqiad.wmnet with OS bullseye [16:32:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10662865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host elastic1122... [16:33:15] (03CR) 10CI reject: [V:04-1] WIP - sre.hosts.provision: try Supermicro BMC passwords automatically [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) (owner: 10Elukey) [16:33:18] (03CR) 10Aklapper: "Right, "Cool URIs don't change" was on my mind but I prefer to break URIs when they link to stuff that simply shall not get any exposure." [puppet] - 10https://gerrit.wikimedia.org/r/1130096 (https://phabricator.wikimedia.org/T307965) (owner: 10Aklapper) [16:35:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2253.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:35:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2256.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:36:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2254.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:36:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2258.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:36:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2253.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:36:19] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10662879 (10MatthewVernon) a:05MatthewVernon→03None No changes needed for these nodes (install and site.pp is ready for ms-fe*) [16:36:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2257.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:36:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2255.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:37:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2254.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:37:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2255.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:37:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2256.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:37:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2257.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:37:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2258.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:37:52] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2257.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:37:54] (03PS15) 10Elukey: WIP - sre.hosts.provision: try Supermicro BMC passwords automatically [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) [16:38:03] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe201[56] - https://phabricator.wikimedia.org/T388887#10662882 (10MatthewVernon) a:05MatthewVernon→03None Puppet is already set up (site.pp and installserver) for ms-fe*, so no further action needed from me at this point, you'r... [16:38:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2257.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:41:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2253.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:43:17] (03PS1) 10MVernon: site/install: prep for new apus and thanosnodes [puppet] - 10https://gerrit.wikimedia.org/r/1130151 (https://phabricator.wikimedia.org/T389632) [16:43:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2254.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:43:22] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1122.eqiad.wmnet with reason: host reimage [16:43:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2255.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:43:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2258.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:43:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2256.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:44:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2257.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:46:18] (03PS2) 10MVernon: site/install: prep for new apus and thanos nodes [puppet] - 10https://gerrit.wikimedia.org/r/1130151 (https://phabricator.wikimedia.org/T389632) [16:46:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1122.eqiad.wmnet with reason: host reimage [16:47:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2253.codfw.wmnet with OS bookworm [16:47:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2254.codfw.wmnet with OS bookworm [16:47:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2255.codfw.wmnet with OS bookworm [16:47:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2256.codfw.wmnet with OS bookworm [16:47:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2258.codfw.wmnet with OS bookworm [16:47:20] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10662946 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2254.codfw.wmnet with... [16:47:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2257.codfw.wmnet with OS bookworm [16:47:22] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10662947 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2255.codfw.wmnet with... [16:47:25] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10662948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2256.codfw.wmnet with... [16:47:27] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10662949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2258.codfw.wmnet with... [16:47:29] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10662950 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2257.codfw.wmnet with... [16:48:11] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10662952 (10MatthewVernon) a:05MatthewVernon→03None Puppet work done, unassigning myself [16:48:42] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install thanos-fe200[5-7] - https://phabricator.wikimedia.org/T389634#10662954 (10MatthewVernon) a:05MatthewVernon→03None Puppet work done, unassigning myself. [16:48:45] (03Abandoned) 10Daimona Eaytoy: Move all AbuseFilter config to abusefilter.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477063 (https://phabricator.wikimedia.org/T145931) (owner: 10Daimona Eaytoy) [16:48:53] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10662957 (10MatthewVernon) a:05MatthewVernon→03None Puppet work done, unassigning myself. [16:49:32] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:50:42] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1119.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:58:48] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2253.codfw.wmnet with reason: host reimage [16:59:39] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:00:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:00:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1122.eqiad.wmnet with OS bullseye [17:00:21] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10663004 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host elastic1122.eqi... [17:00:52] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10663011 (10Jclark-ctr) [17:01:14] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [17:01:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2253.codfw.wmnet with reason: host reimage [17:02:07] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve2004.codfw.wmnet with OS bookworm [17:06:07] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: set cloudsw cloud vrf xlink dns to wikimediacloud.org domain - cmooney@cumin1002" [17:06:12] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: set cloudsw cloud vrf xlink dns to wikimediacloud.org domain - cmooney@cumin1002" [17:06:12] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:07:21] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [17:08:02] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10663063 (10Jclark-ctr) [17:09:54] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:16:13] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:17:45] (03PS1) 10Cathal Mooney: Add reverse zone for 172.31.0.0/16 [dns] - 10https://gerrit.wikimedia.org/r/1130159 (https://phabricator.wikimedia.org/T379283) [17:19:30] (03CR) 10CI reject: [V:04-1] Add reverse zone for 172.31.0.0/16 [dns] - 10https://gerrit.wikimedia.org/r/1130159 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney) [17:19:34] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10663162 (10phaultfinder) [17:22:43] (03PS2) 10Cathal Mooney: Add reverse zone for 172.31.0.0/16 [dns] - 10https://gerrit.wikimedia.org/r/1130159 (https://phabricator.wikimedia.org/T379283) [17:24:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10663189 (10phaultfinder) [17:25:45] (03PS1) 10Bking: elastic: add test hieradata to help with LVS migration [puppet] - 10https://gerrit.wikimedia.org/r/1130162 (https://phabricator.wikimedia.org/T387569) [17:26:00] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130162 (https://phabricator.wikimedia.org/T387569) (owner: 10Bking) [17:27:45] (03CR) 10Ssingh: [C:03+1] Add reverse zone for 172.31.0.0/16 [dns] - 10https://gerrit.wikimedia.org/r/1130159 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney) [17:30:55] (03CR) 10Cathal Mooney: [C:03+2] Add reverse zone for 172.31.0.0/16 [dns] - 10https://gerrit.wikimedia.org/r/1130159 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney) [17:31:17] !log cmooney@dns2005 START - running authdns-update [17:32:39] (03CR) 10Bking: "@vgutierrez@wikimedia.org Let me know if this approach seems OK to you. If so, we'll start adding more host hiera files." [puppet] - 10https://gerrit.wikimedia.org/r/1130162 (https://phabricator.wikimedia.org/T387569) (owner: 10Bking) [17:33:09] !log cmooney@dns2005 END - running authdns-update [17:43:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:43:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2253.codfw.wmnet with OS bookworm [17:44:10] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10663370 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2253.codfw.wmnet with OS... [17:46:17] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10663386 (10Jhancock.wm) [17:47:25] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10663406 (10Jhancock.wm) wikikube-worker2254-58 did not ping on the network address after dhcp. need to investigate. [17:47:30] !log mforns@deploy1003 Started deploy [airflow-dags/analytics@317134a]: finalize airflow migration [17:48:02] !log mforns@deploy1003 Finished deploy [airflow-dags/analytics@317134a]: finalize airflow migration (duration: 00m 44s) [17:49:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10663409 (10phaultfinder) [17:51:26] !log dancy@deploy1003 Installing scap version "4.143.1" for 2 host(s) [17:53:12] !log dancy@deploy1003 Installation of scap version "4.143.1" completed for 2 hosts [17:56:42] (03PS1) 10Cwhite: bugfix: add back missing pipe char to conform to dogstatsd spec [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130166 (https://phabricator.wikimedia.org/T359385) [17:58:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130128 (https://phabricator.wikimedia.org/T383801) (owner: 10Santiago Faci) [17:59:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130166 (https://phabricator.wikimedia.org/T359385) (owner: 10Cwhite) [18:02:28] (03CR) 10Dzahn: [C:03+2] profile::tlsproxy::envoy: Tweak an error message [puppet] - 10https://gerrit.wikimedia.org/r/1129940 (owner: 10Ahmon Dancy) [18:07:33] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2256.codfw.wmnet with OS bookworm [18:07:39] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2257.codfw.wmnet with OS bookworm [18:07:45] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10663496 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2256.codfw.wmnet with OS... [18:07:45] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2254.codfw.wmnet with OS bookworm [18:07:46] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10663497 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2257.codfw.wmnet with OS... [18:07:50] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10663498 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2254.codfw.wmnet with OS... [18:07:51] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2258.codfw.wmnet with OS bookworm [18:07:54] 06SRE-OnFire, 10Incident Tooling: Reconsider default incident visibility - https://phabricator.wikimedia.org/T389664 (10Eevans) 03NEW [18:07:56] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10663509 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2258.codfw.wmnet with OS... [18:08:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2255.codfw.wmnet with OS bookworm [18:08:07] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10663510 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2255.codfw.wmnet with OS... [18:19:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10663548 (10phaultfinder) [18:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10663572 (10phaultfinder) [18:25:58] !log enabling ospf cloudsw1-c8-eqiad [18:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:58] !log eevans@cumin1002 START - Cookbook sre.hosts.remove-downtime for restbase1043.eqiad.wmnet [18:27:58] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase1043.eqiad.wmnet [18:32:59] !log eevans@cumin1002 START - Cookbook sre.dns.netbox [18:40:19] !log eevans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Addtional IPs for restbase1044 - eevans@cumin1002" [18:40:24] !log eevans@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Addtional IPs for restbase1044 - eevans@cumin1002" [18:40:24] !log eevans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:41:10] (03PS1) 10Sbisson: Enable Section Translation and Unified Dashboard on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130169 (https://phabricator.wikimedia.org/T387821) [18:42:03] !log eevans@cumin1002 START - Cookbook sre.dns.netbox [18:43:15] (03PS1) 10Kimberly Sarabia: Deploy donate banner everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130170 (https://phabricator.wikimedia.org/T388438) [18:44:16] (03PS2) 10Kimberly Sarabia: Deploy donate banner everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130170 (https://phabricator.wikimedia.org/T388438) [18:45:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10663640 (10phaultfinder) [18:46:16] !log eevans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Addtional IPs for restbase1045 - eevans@cumin1002" [18:46:21] !log eevans@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Addtional IPs for restbase1045 - eevans@cumin1002" [18:46:21] !log eevans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:51:12] (03PS1) 10Eevans: restbase: bootstrap restbase1044 (refresh for restbase1029) [puppet] - 10https://gerrit.wikimedia.org/r/1130174 (https://phabricator.wikimedia.org/T389423) [18:51:13] (03PS1) 10Eevans: restbase: bootstrap restbase1045 (refresh for restbase1030) [puppet] - 10https://gerrit.wikimedia.org/r/1130175 (https://phabricator.wikimedia.org/T389423) [18:51:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130169 (https://phabricator.wikimedia.org/T387821) (owner: 10Sbisson) [18:54:10] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130174 (https://phabricator.wikimedia.org/T389423) (owner: 10Eevans) [18:54:20] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130175 (https://phabricator.wikimedia.org/T389423) (owner: 10Eevans) [18:57:37] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130175 (https://phabricator.wikimedia.org/T389423) (owner: 10Eevans) [19:07:54] !log dancy@deploy1003 Installing scap version "4.143.2" for 2 host(s) [19:09:40] !log dancy@deploy1003 Installation of scap version "4.143.2" completed for 2 hosts [19:11:36] FIRING: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [19:16:37] looking at this ^ [19:18:21] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [19:18:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [19:36:43] (03CR) 10Eevans: [C:03+1] site/install: prep for new apus and thanos nodes [puppet] - 10https://gerrit.wikimedia.org/r/1130151 (https://phabricator.wikimedia.org/T389632) (owner: 10MVernon) [19:39:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10663829 (10phaultfinder) [19:44:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10663835 (10phaultfinder) [19:49:45] FIRING: MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [19:51:53] 06SRE-OnFire, 10Incident Tooling: Reconsider default incident visibility - https://phabricator.wikimedia.org/T389664#10663848 (10bd808) To help folks reason about these choices a bit: * Anyone in #acl_sre-team is also in #acl_security because the former is a subproject of the latter. ** This means that anyone... [19:54:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10663852 (10phaultfinder) [19:54:45] RESOLVED: MjolnirUpdateFailureRateExceedesThreshold: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [19:59:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10663867 (10phaultfinder) [20:14:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10663926 (10phaultfinder) [20:54:29] (03CR) 10BCornwall: [C:03+2] upgrade cp3080 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129866 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [20:54:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10664118 (10phaultfinder) [20:55:48] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp3080.esams.wmnet} and A:cp [20:59:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10664123 (10phaultfinder) [21:00:49] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp3080.esams.wmnet} and A:cp [21:04:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10664138 (10phaultfinder) [21:09:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10664141 (10phaultfinder) [21:15:29] (03PS3) 10Dwisehaupt: community_civicrm: Add profile::community_civicrm::mail [puppet] - 10https://gerrit.wikimedia.org/r/1128565 (https://phabricator.wikimedia.org/T383715) [21:17:57] (03CR) 10Dwisehaupt: "Thanks for the review @jhathaway@wikimedia.org This last patchset includes the final username along with a hiera variable to override it i" [puppet] - 10https://gerrit.wikimedia.org/r/1128565 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [21:26:40] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: try rolling operation without allow-yellow flag - ryankemper@cumin2002 - T389119 [21:26:45] T389119: Upgrade wmf_opensearch_search_plugins .deb and restart opensearch - https://phabricator.wikimedia.org/T389119 [21:27:55] !log sukhe@deploy1003:~$ echo 'https://spiderpig.wikimedia.org/api/whoami' | mwscript-k8s --attach -- purgeList.php [21:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:02] dancy: ^ [21:28:06] (03CR) 10Pppery: "https://kab.wikipedia.org/wiki/Uslig:ApiSandbox?uselang=en#action=query&format=json&list=users&formatversion=2&usprop=rights&ususers=Flow%" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129795 (https://phabricator.wikimedia.org/T380911) (owner: 10Zoe) [21:29:26] Thanks! [21:33:25] (03CR) 10BCornwall: [C:03+2] upgrade cp3081 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1129867 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [21:36:16] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp3081.esams.wmnet} and A:cp [21:41:14] (03PS1) 10JHathaway: apt::package_from_component: unique sources list [puppet] - 10https://gerrit.wikimedia.org/r/1130195 (https://phabricator.wikimedia.org/T388388) [21:41:54] (03PS1) 10Creynolds: component: puppet dumps web enterprise page update [puppet] - 10https://gerrit.wikimedia.org/r/1130196 [21:43:06] (03CR) 10CI reject: [V:04-1] apt::package_from_component: unique sources list [puppet] - 10https://gerrit.wikimedia.org/r/1130195 (https://phabricator.wikimedia.org/T388388) (owner: 10JHathaway) [21:43:08] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp3081.esams.wmnet} and A:cp [21:45:06] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: try rolling operation without allow-yellow flag - ryankemper@cumin2002 - T389119 [21:45:10] T389119: Upgrade wmf_opensearch_search_plugins .deb and restart opensearch - https://phabricator.wikimedia.org/T389119 [21:48:57] (03PS2) 10JHathaway: apt::package_from_component: unique sources list [puppet] - 10https://gerrit.wikimedia.org/r/1130195 (https://phabricator.wikimedia.org/T388388) [21:50:53] (03CR) 10CI reject: [V:04-1] apt::package_from_component: unique sources list [puppet] - 10https://gerrit.wikimedia.org/r/1130195 (https://phabricator.wikimedia.org/T388388) (owner: 10JHathaway) [21:52:25] FIRING: SystemdUnitFailed: opensearch_1@cloudelastic-eqiad.service on cloudelastic1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:57:25] FIRING: [4x] SystemdUnitFailed: opensearch_1@cloudelastic-eqiad.service on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:02:42] (03PS1) 10Umherirrender: Improve function and property documentation for php code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) [22:15:42] (03CR) 10Zoe: "thanks for looking in to it - I guess I'll need to dig further" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129795 (https://phabricator.wikimedia.org/T380911) (owner: 10Zoe) [22:17:48] (03CR) 10Ahoelzl: [C:03+1] component: puppet dumps web enterprise page update [puppet] - 10https://gerrit.wikimedia.org/r/1130196 (owner: 10Creynolds) [22:18:55] (03CR) 10Pppery: "https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Flow/+/1130197 would help here a lot." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129795 (https://phabricator.wikimedia.org/T380911) (owner: 10Zoe) [22:24:18] 06SRE-OnFire, 10Incident Tooling: Reconsider default incident visibility - https://phabricator.wikimedia.org/T389664#10664347 (10bd808) My personal choice would be #acl_security. Tasks with that protection level have an existing review workflow that can be used to convert from a Security Issue to a public task... [22:24:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10664348 (10phaultfinder) [22:26:12] (03PS3) 10JHathaway: apt::package_from_component: unique sources list [puppet] - 10https://gerrit.wikimedia.org/r/1130195 (https://phabricator.wikimedia.org/T388388) [22:27:25] RESOLVED: [4x] SystemdUnitFailed: opensearch_1@cloudelastic-eqiad.service on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:28:20] 06SRE-OnFire, 10Incident Tooling: Reconsider default incident visibility - https://phabricator.wikimedia.org/T389664#10664372 (10Eevans) >>! In T389664#10663848, @bd808 wrote: > To help folks reason about these choices a bit: > * Anyone in #acl_sre-team is also in #acl_security because the former is a subproje... [22:30:11] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130195 (https://phabricator.wikimedia.org/T388388) (owner: 10JHathaway) [22:30:13] (03CR) 10Cwhite: [C:03+1] logstash: move filter_truncate before indexing/output [puppet] - 10https://gerrit.wikimedia.org/r/1129128 (https://phabricator.wikimedia.org/T389072) (owner: 10Filippo Giunchedi) [22:30:16] 06SRE-OnFire, 10Incident Tooling: Reconsider default incident visibility - https://phabricator.wikimedia.org/T389664#10664380 (10Eevans) p:05Triage→03Medium [22:30:28] (03CR) 10Cwhite: [C:03+1] hieradata: move prometheus k8s instances off prometheus2006 [puppet] - 10https://gerrit.wikimedia.org/r/1129173 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [22:30:33] (03CR) 10Krinkle: [C:03+2] docroot: Enable Chrome credential sharing on foundation.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129922 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle) [22:30:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129922 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle) [22:31:23] (03Merged) 10jenkins-bot: docroot: Enable Chrome credential sharing on foundation.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129922 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle) [22:31:57] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1129922|docroot: Enable Chrome credential sharing on foundation.wikimedia.org (T385520)]] [22:32:01] T385520: Deploy DAL files for seamless credential sharing in Chrome - https://phabricator.wikimedia.org/T385520 [22:35:05] (03PS4) 10JHathaway: apt::package_from_component: unique sources list [puppet] - 10https://gerrit.wikimedia.org/r/1130195 (https://phabricator.wikimedia.org/T388388) [22:35:14] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130195 (https://phabricator.wikimedia.org/T388388) (owner: 10JHathaway) [22:36:52] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1129922|docroot: Enable Chrome credential sharing on foundation.wikimedia.org (T385520)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:40:44] 06SRE-OnFire, 10Incident Tooling: Reconsider default incident visibility - https://phabricator.wikimedia.org/T389664#10664409 (10AntiCompositeNumber) #acl_sre-team != #acl_security_sre. #acl_sre-team does not appear on https://phabricator.wikimedia.org/project/subprojects/30/ as a subproject of #acl_security (... [22:45:16] 06SRE-OnFire, 10Incident Tooling: Reconsider default incident visibility - https://phabricator.wikimedia.org/T389664#10664414 (10bd808) >>! In T389664#10664409, @AntiCompositeNumber wrote: > #acl_sre-team != #acl_security_sre. #acl_sre-team does not appear on https://phabricator.wikimedia.org/project/subprojec... [22:51:11] !log krinkle@deploy1003 krinkle: Continuing with sync [22:58:11] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1129922|docroot: Enable Chrome credential sharing on foundation.wikimedia.org (T385520)]] (duration: 26m 14s) [22:58:16] T385520: Deploy DAL files for seamless credential sharing in Chrome - https://phabricator.wikimedia.org/T385520 [22:59:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10664442 (10phaultfinder) [23:11:01] 06SRE, 06Commons, 10MediaWiki-File-management, 06serviceops, and 2 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155#10664451 (10zdev) >>! In T266155#9766334, @Bawolff wrote: > I think if we did deliver the wrong thumbsiz... [23:18:21] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [23:18:38] 06SRE, 10LDAP-Access-Requests: Grant Access to `releng` & `gerritadmin` for bd808 - https://phabricator.wikimedia.org/T389699 (10bd808) 03NEW [23:18:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [23:22:14] 06SRE, 10LDAP-Access-Requests: Grant Access to `releng` & `gerritadmin` for bd808 - https://phabricator.wikimedia.org/T389699#10664470 (10bd808) My manager is @Lferreira and the approver for these groups is @thcipriani. I am working as a Principal Engineer in the Developer Experience group with a current prima... [23:26:05] 06SRE, 10LDAP-Access-Requests: Grant Access to `releng` & `gerritadmin` for bd808 - https://phabricator.wikimedia.org/T389699#10664472 (10bd808) Am I correct in understanding that being added to the `releng` LDAP group will also mean that I will be added to the `release-engineering` group in `ops/puppet.git:mo... [23:28:14] 06SRE, 10LDAP-Access-Requests: Grant Access to `releng` & `gerritadmin` for bd808 - https://phabricator.wikimedia.org/T389699#10664473 (10bd808) [23:34:47] 06SRE, 10LDAP-Access-Requests: Grant Access to `releng` & `gerritadmin` for bd808 - https://phabricator.wikimedia.org/T389699#10664480 (10thcipriani) I'm the approver for the corresponding groups in `admin/data/data.yaml` (i.e., `release-engineering`). I've never been asked to approve for the ldap group (but u... [23:40:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.068s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:44:23] (03PS3) 10Jdlrobson: Deploy donate banner for all wikis except English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130170 (https://phabricator.wikimedia.org/T388438) (owner: 10Kimberly Sarabia) [23:44:46] (03CR) 10Jdlrobson: [C:03+1] Deploy donate banner for all wikis except English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130170 (https://phabricator.wikimedia.org/T388438) (owner: 10Kimberly Sarabia) [23:45:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.386s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:58:05] (03Abandoned) 10Reedy: nooop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129365 (owner: 10Reedy) [23:58:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129229 (https://phabricator.wikimedia.org/T389348) (owner: 10Reedy) [23:59:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10664521 (10phaultfinder)