[00:27:06] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:31:38] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [00:32:06] (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:38:51] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/986828 [00:38:57] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/986828 (owner: 10TrainBranchBot) [00:42:03] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:27] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:59:25] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/986828 (owner: 10TrainBranchBot) [01:03:51] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T354155 (10phaultfinder) [01:40:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:44] 10SRE, 10Commons, 10Traffic-Icebox, 10Wikimedia-Site-requests, and 2 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517 (10Pols12) Wikidata does not use `uselang` hack: this task is specific to Commons. See T58464 for a global task (al... [02:37:06] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:49:45] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [03:07:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.12 [core] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/986829 (https://phabricator.wikimedia.org/T350088) [03:07:28] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.42.0-wmf.12 [core] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/986829 (https://phabricator.wikimedia.org/T350088) (owner: 10TrainBranchBot) [03:08:54] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:26:37] (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.12 [core] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/986829 (https://phabricator.wikimedia.org/T350088) (owner: 10TrainBranchBot) [04:01:58] (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986869 (https://phabricator.wikimedia.org/T350088) [04:02:00] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986869 (https://phabricator.wikimedia.org/T350088) (owner: 10TrainBranchBot) [04:03:26] (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986869 (https://phabricator.wikimedia.org/T350088) (owner: 10TrainBranchBot) [04:03:51] !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.12 refs T350088 [04:04:01] T350088: 1.42.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T350088 [04:19:30] (KubernetesAPINotScrapable) firing: (4) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [04:31:38] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [04:47:55] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:00:40] !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.12 refs T350088 (duration: 56m 48s) [05:00:47] T350088: 1.42.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T350088 [05:11:03] 10SRE, 10Wikimedia-Mailing-lists, 10Performance Issue: https://lists.wikimedia.org/postorius is sloooow - https://phabricator.wikimedia.org/T353891 (10Ladsgroup) I see the problem in two areas only: - Opening the main page - Opening the page for a mailing list with a lot of members. For the first one, it... [05:13:57] PROBLEM - MariaDB Replica Lag: s1 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 628.63 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:14:25] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:23:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw appserver POST/200: 2.21186413264781s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=appserver&var-method=POST - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:28:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw appserver POST/200: 2.21186413264781s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=appserver&var-method=POST - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceede [05:40:33] RECOVERY - MariaDB Replica Lag: s1 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:57:39] (03PS1) 10Marostegui: installserver: Do not format dbstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/986871 [06:01:37] (03CR) 10Marostegui: [C: 03+2] installserver: Do not format dbstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/986871 (owner: 10Marostegui) [06:06:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2144.codfw.wmnet with OS bookworm [06:12:18] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) @ABran-WMF @MoritzMuehlenhoff we are going to have to give this more priority, dbstore1003 (s1) is now failing in orchestrator as be... [06:13:34] (CirrusSearchJobQueueLagTooHigh) firing: CirrusSearch job cirrusSearchLinksUpdate lag is too high: 6h 15m 4s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [06:18:34] (CirrusSearchJobQueueLagTooHigh) resolved: CirrusSearch job cirrusSearchLinksUpdate lag is too high: 6h 9m 41s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [06:24:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2144.codfw.wmnet with reason: host reimage [06:27:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2144.codfw.wmnet with reason: host reimage [06:45:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2144.codfw.wmnet with OS bookworm [08:00:04] Amir1 and Urbanecm: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240102T0800). nyaa~ [08:00:04] No Gerrit patches in the queue for this window AFAICS. [08:12:17] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:13:49] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:16:37] (03PS1) 10Peter Fischer: configure message_key_fields for update_pipeline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987028 [08:17:17] (03CR) 10Peter Fischer: [C: 03+2] configure message_key_fields for update_pipeline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987028 (owner: 10Peter Fischer) [08:18:00] (03Merged) 10jenkins-bot: configure message_key_fields for update_pipeline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987028 (owner: 10Peter Fischer) [08:18:11] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51307 bytes in 6.414 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:19:31] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.313 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:19:45] (KubernetesAPINotScrapable) firing: (4) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [08:20:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:25:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:25:31] (03CR) 10Sohom Datta: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984288 (https://phabricator.wikimedia.org/T341388) (owner: 10Houseblaster) [08:26:04] !log akosiaris@cumin1001 START - Cookbook sre.hosts.provision for host mw2448.mgmt.codfw.wmnet with reboot policy GRACEFUL [08:26:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:27:23] !log restart prometheus@k8s prometheus@k8s-aux in eqiad - T343529 [08:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:27] T343529: Prometheus doesn't reload or alert on expired client certificates - https://phabricator.wikimedia.org/T343529 [08:27:49] Amir1: Hi! I accidentally +2ed a minor config change that was intended to be deployed as part of a back port window. (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/987028) Does that get deployed now or do I still have to request a deployment explicitly? [08:28:35] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:30:24] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:31:23] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.301 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:31:38] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [08:33:41] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2448.mgmt.codfw.wmnet with reboot policy GRACEFUL [08:34:27] (03PS1) 10Giuseppe Lavagetto: Remove throttle exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987031 (https://phabricator.wikimedia.org/T352569) [08:34:29] (03PS1) 10Giuseppe Lavagetto: Use shellbox for djvu handling on kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987032 (https://phabricator.wikimedia.org/T352515) [08:34:30] (KubernetesAPINotScrapable) resolved: (4) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [08:34:31] (03PS1) 10Giuseppe Lavagetto: Always process media files via shellbox on k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987033 (https://phabricator.wikimedia.org/T352515) [08:35:21] <_joe_> jouncebot: nowandnext [08:35:21] For the next 0 hour(s) and 24 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240102T0800) [08:35:21] In 2 hour(s) and 24 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240102T1100) [08:35:39] <_joe_> uhm I guess no one's around [08:39:31] (03PS2) 10Brouberol: spark-history: add availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/984223 (https://phabricator.wikimedia.org/T353717) [08:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:41:43] 10SRE, 10ops-codfw, 10serviceops: mw2448.codfw.wmnet is down - https://phabricator.wikimedia.org/T353679 (10akosiaris) The machine isn't pooled yet into traffic. There is an alert for frequent changes due to puppet run. Indeed the following happens at every puppet run `Notice: /Stage[main]/Cpufrequtils/Exec... [08:43:39] (03CR) 10Brouberol: "It _seems_ that dual stack for Services without selectors is supported and does what we want by default: https://kubernetes.io/docs/concep" [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [09:00:40] 10SRE, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Riddy Khan - https://phabricator.wikimedia.org/T353370 (10MoritzMuehlenhoff) I've corrected the group membership; contractors should be in cn=wmf, the cn=nda LDAP group is for community members w... [09:02:22] !log pfischer@deploy2002 Started scap: Backport for [[gerrit:987028|configure message_key_fields for update_pipeline]] [09:02:54] !log installing nodejs security updates on bookworm [09:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:09] !log pfischer@deploy2002 pfischer: Backport for [[gerrit:987028|configure message_key_fields for update_pipeline]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:05:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:05:53] !log pfischer@deploy2002 pfischer: Continuing with sync [09:06:08] pfischer: if you accidentally +2 a mw-config patch, you need to either immediately deploy it or immediately revert and pull the reverting commit to deploy2002 [09:06:37] taavi: Thanks! Deployment is already in progress. [09:09:54] (03CR) 10Urbanecm: [C: 03+1] [namespaces] Use correct diacritics in Romanian (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972473 (https://phabricator.wikimedia.org/T350739) (owner: 10Strainu) [09:10:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:10:39] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:11:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [09:17:57] !log pfischer@deploy2002 Finished scap: Backport for [[gerrit:987028|configure message_key_fields for update_pipeline]] (duration: 15m 35s) [09:20:24] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:21:00] (03PS1) 10Btullis: Bring dbstore1008 into service [puppet] - 10https://gerrit.wikimedia.org/r/987116 (https://phabricator.wikimedia.org/T351921) [09:21:40] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [09:22:00] (03CR) 10Marostegui: [C: 03+1] Bring dbstore1008 into service [puppet] - 10https://gerrit.wikimedia.org/r/987116 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis) [09:23:27] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Commissioning new database server [09:23:41] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Commissioning new database server [09:23:55] (03PS1) 10Marostegui: dbstore1008: Add sections [puppet] - 10https://gerrit.wikimedia.org/r/987117 (https://phabricator.wikimedia.org/T351921) [09:24:58] 10SRE, 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T353913 (10phaultfinder) [09:26:07] (03CR) 10Muehlenhoff: [C: 03+2] check_wmf_styleguide: Remove check to enforce presence of system::role [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/983689 (owner: 10Muehlenhoff) [09:31:51] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Hardware): Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10fnegri) @Jclark-ctr the host was restarted on Dec 22 at 18:29 UTC. Has the CPU been replaced? [09:35:04] (03PS2) 10Btullis: Bring dbstore1008 into service [puppet] - 10https://gerrit.wikimedia.org/r/987116 (https://phabricator.wikimedia.org/T351921) [09:37:18] (03CR) 10Btullis: [C: 03+1] spark-history: add availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/984223 (https://phabricator.wikimedia.org/T353717) (owner: 10Brouberol) [09:41:13] (03PS3) 10Btullis: Bring dbstore1008 into service [puppet] - 10https://gerrit.wikimedia.org/r/987116 (https://phabricator.wikimedia.org/T351921) [09:41:54] (03CR) 10Brouberol: [C: 03+2] spark-history: add availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/984223 (https://phabricator.wikimedia.org/T353717) (owner: 10Brouberol) [09:42:13] (03CR) 10Marostegui: [C: 03+1] Bring dbstore1008 into service [puppet] - 10https://gerrit.wikimedia.org/r/987116 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis) [09:44:57] (03PS4) 10Btullis: Bring dbstore1008 into service [puppet] - 10https://gerrit.wikimedia.org/r/987116 (https://phabricator.wikimedia.org/T351921) [09:45:20] btullis: https://gerrit.wikimedia.org/r/c/operations/puppet/+/987117 [09:46:08] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/990/con" [puppet] - 10https://gerrit.wikimedia.org/r/987116 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis) [09:46:10] marostegui: I just added the same to https://gerrit.wikimedia.org/r/c/operations/puppet/+/987116 based on the pcc output. [09:46:28] Sorry for duplicating [09:46:33] (03CR) 10Marostegui: [C: 04-1] "Please disable notifications for now" [puppet] - 10https://gerrit.wikimedia.org/r/987116 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis) [09:46:46] (03Abandoned) 10Marostegui: dbstore1008: Add sections [puppet] - 10https://gerrit.wikimedia.org/r/987117 (https://phabricator.wikimedia.org/T351921) (owner: 10Marostegui) [09:47:21] (03PS5) 10Btullis: Bring dbstore1008 into service [puppet] - 10https://gerrit.wikimedia.org/r/987116 (https://phabricator.wikimedia.org/T351921) [09:47:30] 10SRE, 10Discovery-Search, 10serviceops, 10Patch-For-Review: SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10pfischer) [09:47:53] (03CR) 10Marostegui: [C: 03+1] Bring dbstore1008 into service [puppet] - 10https://gerrit.wikimedia.org/r/987116 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis) [09:48:39] 10SRE, 10serviceops, 10Discovery-Search (Current work), 10Patch-For-Review: SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10pfischer) [09:49:33] 10SRE, 10serviceops, 10Discovery-Search (Current work), 10Patch-For-Review: SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10pfischer) 05Open→03In progress p:05Triage→03High [10:01:22] (03CR) 10Muehlenhoff: [C: 03+2] rsync::quickdatacopy: Add support for creating nftables-compatible firewall [puppet] - 10https://gerrit.wikimedia.org/r/984615 (owner: 10Muehlenhoff) [10:11:34] (03CR) 10Slyngshede: C:puppetmaster::monitoring Prometheus stats for Puppetmerge. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/983376 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:11:48] (03CR) 10Slyngshede: [C: 03+2] Add instance to summary for NTP [alerts] - 10https://gerrit.wikimedia.org/r/981175 (owner: 10Slyngshede) [10:13:01] (03Merged) 10jenkins-bot: Add instance to summary for NTP [alerts] - 10https://gerrit.wikimedia.org/r/981175 (owner: 10Slyngshede) [10:15:20] pfischer: sorry I was asleep, yeah go ahead please (I think you already did) [10:18:13] 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer) [10:21:04] (03CR) 10Jelto: "The Dockerfile templates for buster and bookworm also use this workaround. Is /usr/share/man/man1 needed on all Debian distributions or is" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/984297 (https://phabricator.wikimedia.org/T352003) (owner: 10BCornwall) [10:24:15] (03CR) 10Btullis: [C: 03+2] Bring dbstore1008 into service [puppet] - 10https://gerrit.wikimedia.org/r/987116 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis) [10:25:04] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/991/con" [puppet] - 10https://gerrit.wikimedia.org/r/984618 (owner: 10Muehlenhoff) [10:26:27] (03CR) 10Muehlenhoff: [C: 03+2] os-reports: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984613 (owner: 10Muehlenhoff) [10:26:43] (03CR) 10Volans: [C: 03+1] "LGTM for the python part" [puppet] - 10https://gerrit.wikimedia.org/r/983376 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:28:57] (03PS1) 10Hashar: gerrit: make LDAP groups visible to users [puppet] - 10https://gerrit.wikimedia.org/r/987120 (https://phabricator.wikimedia.org/T354069) [10:36:14] 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer) 05In progress→03Open a:05pfischer→03None [10:38:10] !log fetching haproxy 2.6.16 for thirdparty/haproxy26 bullseye-wikimedia (apt.wm.o) [10:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:51] 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) a:03brouberol [10:46:21] 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer) [10:50:24] 10SRE, 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work), 10Patch-For-Review: SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10pfischer) a:05pfischer→03None [10:50:51] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp4044.ulsfo.wmnet,cp4050.ulsfo.wmnet} and A:cp [10:50:54] jouncebot: nowandnext [10:50:54] No deployments scheduled for the next 0 hour(s) and 9 minute(s) [10:50:54] In 0 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240102T1100) [10:55:58] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp4044.ulsfo.wmnet,cp4050.ulsfo.wmnet} and A:cp [10:56:49] 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) ` brouberol@kafka-jumbo1010:~$ kafka configs --entity-type topics --entity-name 'eqiad.me... [10:58:22] 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer) [10:58:30] 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) We can see the impact on the overall topic size {F41648651} [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240102T1100) [11:01:32] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984618 (owner: 10Muehlenhoff) [11:13:35] 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) ` brouberol@kafka-jumbo1010:~$ kafka configs --entity-type topics --entity-name 'codfw.me... [11:17:53] 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) 22% of the topic segments were compacted and deleted: {F41648664} [11:18:30] 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) [11:32:55] (03CR) 10Alexandros Kosiaris: [C: 04-1] Use shellbox for djvu handling on kubernetes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987032 (https://phabricator.wikimedia.org/T352515) (owner: 10Giuseppe Lavagetto) [11:33:54] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Disk (sdh) failed in ms-be2068 - https://phabricator.wikimedia.org/T354180 (10MatthewVernon) [11:34:06] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Disk (sdh) failed in ms-be2068 - https://phabricator.wikimedia.org/T354180 (10MatthewVernon) p:05Triage→03High [11:46:13] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:puppetmaster::monitoring Blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/983713 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [11:46:15] (03PS3) 10Muehlenhoff: aptrepo::staging: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984251 [11:55:52] (03CR) 10Muehlenhoff: [C: 03+2] aptrepo::staging: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984251 (owner: 10Muehlenhoff) [12:07:55] (03CR) 10Muehlenhoff: [C: 03+2] an-web: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984252 (owner: 10Muehlenhoff) [12:08:45] (03PS2) 10Muehlenhoff: statistics::explorer::misc_jobs: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984818 [12:12:09] (03PS1) 10Brouberol: spark-history: Remove stale comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/987132 [12:14:06] (03PS7) 10Slyngshede: C:puppetmaster::monitoring Prometheus stats for Puppetmerge. [puppet] - 10https://gerrit.wikimedia.org/r/983376 (https://phabricator.wikimedia.org/T350694) [12:17:23] (03CR) 10EoghanGaffney: [C: 03+1] phabricator: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984811 (owner: 10Muehlenhoff) [12:17:46] (03CR) 10Btullis: [C: 03+1] spark-history: Remove stale comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/987132 (owner: 10Brouberol) [12:17:55] 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) [12:18:02] 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) 05Open→03Resolved The change has been applied an hour ago (at the line). We don't obs... [12:19:42] (03CR) 10Btullis: [C: 03+1] statistics::rsyncd: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984803 (owner: 10Muehlenhoff) [12:20:59] (03CR) 10Btullis: [C: 03+1] statistics::explorer::misc_jobs: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984818 (owner: 10Muehlenhoff) [12:29:32] (03CR) 10Brouberol: [C: 03+2] spark-history: Remove stale comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/987132 (owner: 10Brouberol) [12:31:38] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [12:36:42] (03PS1) 10Ladsgroup: Update virtual domain for url shortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987134 [12:40:43] (03PS1) 10Muehlenhoff: Add a comment which clarifies the purpose of the kadmin rsync setup [puppet] - 10https://gerrit.wikimedia.org/r/987136 [12:41:39] (03CR) 10Jforrester: "<3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984277 (https://phabricator.wikimedia.org/T253216) (owner: 10Ladsgroup) [12:43:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984617 (owner: 10Muehlenhoff) [12:51:06] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [12:58:16] (03PS1) 10ArielGlenn: add foundationwiki to the list of central auth login wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987138 (https://phabricator.wikimedia.org/T205347) [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240102T1300) [13:12:54] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/984617 (owner: 10Muehlenhoff) [13:13:59] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987138 (https://phabricator.wikimedia.org/T205347) (owner: 10ArielGlenn) [13:21:06] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [13:23:22] (03PS1) 10Btullis: Upgrade dbstore100[89] to mariadb 10.6 with reimage [puppet] - 10https://gerrit.wikimedia.org/r/987139 (https://phabricator.wikimedia.org/T351921) [13:23:45] (03CR) 10Muehlenhoff: [C: 03+2] kerberos::kdc: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984617 (owner: 10Muehlenhoff) [13:24:44] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/987139 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis) [13:25:13] (03CR) 10Marostegui: [C: 04-1] "You don't need to specify the 106 package if they will be reimaged to bookworm." [puppet] - 10https://gerrit.wikimedia.org/r/987139 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis) [13:27:33] (03PS2) 10Btullis: Upgrade dbstore100[89] to mariadb 10.6 with reimage [puppet] - 10https://gerrit.wikimedia.org/r/987139 (https://phabricator.wikimedia.org/T351921) [13:28:08] (03PS3) 10Btullis: Upgrade dbstore100[89] to mariadb 10.6 with reimage [puppet] - 10https://gerrit.wikimedia.org/r/987139 (https://phabricator.wikimedia.org/T351921) [13:28:24] (03CR) 10Marostegui: [C: 03+1] Upgrade dbstore100[89] to mariadb 10.6 with reimage [puppet] - 10https://gerrit.wikimedia.org/r/987139 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis) [13:31:17] (03CR) 10Btullis: [C: 03+2] Upgrade dbstore100[89] to mariadb 10.6 with reimage [puppet] - 10https://gerrit.wikimedia.org/r/987139 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis) [13:31:41] (03PS1) 10Aklapper: phabricator: Yearly metrics for wikitech-l: Correct strings [puppet] - 10https://gerrit.wikimedia.org/r/987140 [13:38:45] (03PS1) 10Aklapper: phabricator weekly changes email: Exclude listing some WMCS team tags [puppet] - 10https://gerrit.wikimedia.org/r/987141 [13:44:22] (03PS1) 10Aklapper: phabricator weekly changes email: Explain why some queries are listed [puppet] - 10https://gerrit.wikimedia.org/r/987143 [13:54:42] 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10Ottomata) Interesting! Curious, so the reason for using compaction here is just to save space, not... [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240102T1400). [14:00:05] No Gerrit patches in the queue for this window AFAICS. [14:00:13] i'll sneak something out [14:00:33] (03PS2) 10Urbanecm: cswiki: Grant patrolmarks to autopatrolled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985384 (https://phabricator.wikimedia.org/T354004) [14:00:35] (03CR) 10Urbanecm: [C: 03+2] cswiki: Grant patrolmarks to autopatrolled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985384 (https://phabricator.wikimedia.org/T354004) (owner: 10Urbanecm) [14:00:41] (03PS2) 10Urbanecm: csbwiktionary: Set MetaNamespaceName to Wikisłowôrz [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986640 (https://phabricator.wikimedia.org/T354114) [14:01:17] (03Merged) 10jenkins-bot: cswiki: Grant patrolmarks to autopatrolled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985384 (https://phabricator.wikimedia.org/T354004) (owner: 10Urbanecm) [14:01:20] (03CR) 10Urbanecm: [C: 03+2] csbwiktionary: Set MetaNamespaceName to Wikisłowôrz [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986640 (https://phabricator.wikimedia.org/T354114) (owner: 10Urbanecm) [14:02:05] (03Merged) 10jenkins-bot: csbwiktionary: Set MetaNamespaceName to Wikisłowôrz [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986640 (https://phabricator.wikimedia.org/T354114) (owner: 10Urbanecm) [14:02:46] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:985384|cswiki: Grant patrolmarks to autopatrolled (T354004)]], [[gerrit:986640|csbwiktionary: Set MetaNamespaceName to Wikisłowôrz (T354114)]] [14:02:52] T354004: Grant `patrolmarks` to autopatrolled at Czech Wikipedia - https://phabricator.wikimedia.org/T354004 [14:02:52] T354114: Localised name for csb wiktionary - https://phabricator.wikimedia.org/T354114 [14:04:20] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:985384|cswiki: Grant patrolmarks to autopatrolled (T354004)]], [[gerrit:986640|csbwiktionary: Set MetaNamespaceName to Wikisłowôrz (T354114)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:04:46] !log urbanecm@deploy2002 urbanecm: Continuing with sync [14:08:53] (03CR) 10Muehlenhoff: [C: 03+2] statistics::rsyncd: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984803 (owner: 10Muehlenhoff) [14:15:12] (03CR) 10Muehlenhoff: [C: 03+2] Add a comment which clarifies the purpose of the kadmin rsync setup [puppet] - 10https://gerrit.wikimedia.org/r/987136 (owner: 10Muehlenhoff) [14:16:33] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:985384|cswiki: Grant patrolmarks to autopatrolled (T354004)]], [[gerrit:986640|csbwiktionary: Set MetaNamespaceName to Wikisłowôrz (T354114)]] (duration: 13m 46s) [14:16:38] T354004: Grant `patrolmarks` to autopatrolled at Czech Wikipedia - https://phabricator.wikimedia.org/T354004 [14:16:38] T354114: Localised name for csb wiktionary - https://phabricator.wikimedia.org/T354114 [14:16:45] okay... `ssh: connect to host mw2394.codfw.wmnet port 22` [14:17:44] (03CR) 10Muehlenhoff: [C: 03+2] statistics::explorer::misc_jobs: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984818 (owner: 10Muehlenhoff) [14:19:12] which...appears to be pooled as a jobrunner, but unavailable? [14:20:51] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host dbstore1008.eqiad.wmnet with OS bookworm [14:22:33] (03CR) 10Xcollazo: "It is a bit hard to follow what actually changed since whitespace also changed. Now the file has a mix of spaces and tabs." [puppet] - 10https://gerrit.wikimedia.org/r/986181 (owner: 10Ladsgroup) [14:23:03] (03CR) 10Muehlenhoff: [C: 03+2] graphite::production: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984248 (owner: 10Muehlenhoff) [14:26:52] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host dbstore1009.eqiad.wmnet with OS bookworm [14:28:12] (03PS2) 10Muehlenhoff: doc: Switch rsync services to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984800 [14:28:47] (03PS3) 10Muehlenhoff: swift: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984516 [14:32:09] <_joe_> !log confctl select 'name=mw2396.codfw.wmnet' set/pooled=inactive [14:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:34] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbstore1008.eqiad.wmnet with reason: host reimage [14:34:55] 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10MoritzMuehlenhoff) [14:35:55] _joe_: there's a typo, the broken server in need to depool is mw2394, not 2396 [14:36:52] <_joe_> duh [14:36:52] (03CR) 10Bking: [C: 03+2] wdqs: graph split hosts don't need categories [puppet] - 10https://gerrit.wikimedia.org/r/984648 (https://phabricator.wikimedia.org/T352878) (owner: 10Ryan Kemper) [14:37:07] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:58] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1008.eqiad.wmnet with reason: host reimage [14:39:49] (03CR) 10Btullis: [C: 03+2] Retrict access to the spark-history k8s API tokens [puppet] - 10https://gerrit.wikimedia.org/r/984130 (https://phabricator.wikimedia.org/T330176) (owner: 10Btullis) [14:40:00] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbstore1009.eqiad.wmnet with reason: host reimage [14:43:20] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1009.eqiad.wmnet with reason: host reimage [14:44:46] !log [urbanecm@mwmaint2002 ~]$ mwscript namespaceDupes.php --wiki=csbwiktionary --fix # T354114 [14:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:50] T354114: Localised name for csb wiktionary - https://phabricator.wikimedia.org/T354114 [14:51:38] 10SRE, 10Wikimedia-Mailing-lists, 10Performance Issue, 10Upstream: https://lists.wikimedia.org/postorius is sloooow - https://phabricator.wikimedia.org/T353891 (10Reedy) [14:57:07] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:58:10] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbstore1009.eqiad.wmnet with OS bookworm [14:59:05] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbstore1008.eqiad.wmnet with OS bookworm [15:02:30] (03Abandoned) 10Samtar: wikimedia.org: add fox. [dns] - 10https://gerrit.wikimedia.org/r/980935 (https://phabricator.wikimedia.org/T352870) (owner: 10Samtar) [15:02:44] (03Abandoned) 10Samtar: redirects: Add funnel for fox.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/980879 (https://phabricator.wikimedia.org/T352870) (owner: 10Samtar) [15:04:03] (03PS1) 10Muehlenhoff: vtrs: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987149 [15:05:24] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987149 (owner: 10Muehlenhoff) [15:08:19] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Hardware): Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Jclark-ctr) @fnegri yes cpu was replaced [15:10:42] (03PS1) 10Muehlenhoff: aptrepo:migration: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987150 [15:14:15] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Hardware): Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Jclark-ctr) cpu was replaced by dell on Dec 22. performed cpu self test multiple times with no errors, Also tech did swap cpu1 and cpu2 locations. [15:14:27] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Hardware): Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Jclark-ctr) 05Open→03Resolved [15:15:57] (03CR) 10Muehlenhoff: [C: 03+2] aptrepo:migration: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987150 (owner: 10Muehlenhoff) [15:19:38] (03PS1) 10Muehlenhoff: lists: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987151 [15:22:55] (03PS1) 10Btullis: Migrate analytics-hive to a new coordinator [dns] - 10https://gerrit.wikimedia.org/r/987152 (https://phabricator.wikimedia.org/T336045) [15:24:39] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987151 (owner: 10Muehlenhoff) [15:27:09] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984800 (owner: 10Muehlenhoff) [15:27:14] (03CR) 10Muehlenhoff: [C: 03+2] lists: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987151 (owner: 10Muehlenhoff) [15:29:36] (03PS1) 10Muehlenhoff: prometheus::migration: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987153 [15:31:12] (03PS3) 10Muehlenhoff: failoid: Remove system::role [puppet] - 10https://gerrit.wikimedia.org/r/983687 [15:31:41] (03CR) 10CI reject: [V: 04-1] failoid: Remove system::role [puppet] - 10https://gerrit.wikimedia.org/r/983687 (owner: 10Muehlenhoff) [15:32:25] (03CR) 10Hashar: [C: 03+1] "I am not quite sure what is going please deploy whenever it fits :)" [puppet] - 10https://gerrit.wikimedia.org/r/984800 (owner: 10Muehlenhoff) [15:32:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987153 (owner: 10Muehlenhoff) [15:34:27] (03CR) 10Brouberol: [C: 03+1] "I checked that an-coord1003 is in service and running debian 11. Looks good!" [dns] - 10https://gerrit.wikimedia.org/r/987152 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis) [15:35:10] (03CR) 10Btullis: [C: 03+2] Migrate analytics-hive to a new coordinator [dns] - 10https://gerrit.wikimedia.org/r/987152 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis) [15:37:58] (03PS1) 10Muehlenhoff: Bump access for daniram [puppet] - 10https://gerrit.wikimedia.org/r/987154 [15:39:22] 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer) @Ottomata, yes, this was intended to a) save disk space and b) reduce the number of record... [15:40:56] 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10Ottomata) Are you sure you want `delete` in the policy then? Perhaps you want to keep all the lates... [15:43:19] (03CR) 10Muehlenhoff: [C: 03+2] Bump access for daniram [puppet] - 10https://gerrit.wikimedia.org/r/987154 (owner: 10Muehlenhoff) [15:43:37] (03CR) 10Samtar: [C: 03+1] Add "patroller" user group to testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986200 (https://phabricator.wikimedia.org/T354063) (owner: 10Novem Linguae) [15:43:42] (03PS2) 10Samtar: Add "patroller" user group to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986200 (https://phabricator.wikimedia.org/T354063) (owner: 10Novem Linguae) [15:47:15] 10SRE, 10ops-eqiad: SMART errors on ganeti1031 - https://phabricator.wikimedia.org/T353324 (10Jclark-ctr) @MoritzMuehlenhoff would like to swap drive today if your available [15:48:54] 10SRE, 10ops-eqiad: SMART errors on ganeti1031 - https://phabricator.wikimedia.org/T353324 (10MoritzMuehlenhoff) >>! In T353324#9430279, @Jclark-ctr wrote: > @MoritzMuehlenhoff would like to swap drive today if your available Ack, please go ahead. [16:00:04] eoghan, jelto, and arnoldokoth: May I have your attention please! SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240102T1600) [16:09:11] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:10:20] 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10hashar) That hosts also broke during the MediaWiki train: ` 04:55:49 Started sync_wikiversions 04:55:49 sync_wikiversions: 0% (ok: 0; fail: 0; left: 374) 04:58:04 sudo -u mwdeploy -n --... [16:21:05] (03PS2) 10BCornwall: wmf-debci: Also create man1 dir [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/984297 (https://phabricator.wikimedia.org/T352003) [16:21:27] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:22:41] (03CR) 10BCornwall: wmf-debci: Also create man1 dir (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/984297 (https://phabricator.wikimedia.org/T352003) (owner: 10BCornwall) [16:22:44] (03PS1) 10Volans: Use setuptools_scm to set the version [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/987155 [16:25:21] (03CR) 10Giuseppe Lavagetto: Use shellbox for djvu handling on kubernetes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987032 (https://phabricator.wikimedia.org/T352515) (owner: 10Giuseppe Lavagetto) [16:26:17] 10SRE, 10ops-eqiad: SMART errors on ganeti1031 - https://phabricator.wikimedia.org/T353324 (10Jclark-ctr) 05Open→03Resolved Replaced Drive [16:27:05] PROBLEM - MD RAID on ganeti1031 is CRITICAL: CRITICAL: State: degraded, Active: 9, Working: 9, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [16:28:09] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:32] (03CR) 10Jelto: [C: 03+1] "lgtm, but I also hope we can remove this java-specific workaround at some point once it's fixed upstream." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/984297 (https://phabricator.wikimedia.org/T352003) (owner: 10BCornwall) [16:31:39] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [16:34:17] (03PS1) 10BBlack: Make cuminx002 warning more-visible [puppet] - 10https://gerrit.wikimedia.org/r/987156 (https://phabricator.wikimedia.org/T353419) [16:34:31] 10SRE, 10ops-eqiad: Degraded RAID on dumpsdata1006 - https://phabricator.wikimedia.org/T354143 (10Jclark-ctr) a:03Jclark-ctr Confirmed: Service Request 182576745 was successfully submitted. [16:34:57] (03CR) 10CI reject: [V: 04-1] Make cuminx002 warning more-visible [puppet] - 10https://gerrit.wikimedia.org/r/987156 (https://phabricator.wikimedia.org/T353419) (owner: 10BBlack) [16:37:45] (03PS1) 10Btullis: Bring dbstore1009 into service [puppet] - 10https://gerrit.wikimedia.org/r/987157 (https://phabricator.wikimedia.org/T351924) [16:37:53] (03PS1) 10Jgiannelos: wikifeeds: Use core page HTML in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/987158 [16:38:09] (03PS2) 10BBlack: Make cuminx002 warning more-visible [puppet] - 10https://gerrit.wikimedia.org/r/987156 (https://phabricator.wikimedia.org/T353419) [16:39:20] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/996/console" [puppet] - 10https://gerrit.wikimedia.org/r/987157 (https://phabricator.wikimedia.org/T351924) (owner: 10Btullis) [16:40:56] (03PS2) 10Jgiannelos: wikifeeds: Use core page HTML in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/987158 [16:41:14] (03PS2) 10Btullis: Bring dbstore1009 into service [puppet] - 10https://gerrit.wikimedia.org/r/987157 (https://phabricator.wikimedia.org/T351924) [16:42:48] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/997/con" [puppet] - 10https://gerrit.wikimedia.org/r/987157 (https://phabricator.wikimedia.org/T351924) (owner: 10Btullis) [16:44:51] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:48:58] (03PS1) 10Phuedx: Add agent.app_install_id to android.product_metrics.* streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987159 (https://phabricator.wikimedia.org/T353680) [16:52:10] (03PS3) 10BBlack: Make cuminx002 warning more-visible [puppet] - 10https://gerrit.wikimedia.org/r/987156 (https://phabricator.wikimedia.org/T353419) [16:52:36] 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer) @Ottomata, we considered this but but decided against it since a) page_rerender is only o... [16:53:03] (03CR) 10BBlack: "PS3 works as intended now (manually verified with output from compiler on the host)" [puppet] - 10https://gerrit.wikimedia.org/r/987156 (https://phabricator.wikimedia.org/T353419) (owner: 10BBlack) [17:00:04] jhathaway and rzl: OwO what's this, a deployment window?? Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240102T1700). nyaa~ [17:00:04] No Gerrit patches in the queue for this window AFAICS. [17:06:35] 10ops-eqiad, 10DC-Ops: hw troubleshooting: SSD failure (/dev/sd3) for aqs1013.eqiad.wmnet - https://phabricator.wikimedia.org/T354200 (10Eevans) [17:06:40] (03PS1) 10Peter Fischer: Search update pipeline: enable kafka partition discovery [deployment-charts] - 10https://gerrit.wikimedia.org/r/987160 (https://phabricator.wikimedia.org/T354064) [17:08:28] (03CR) 10Peter Fischer: "Once this has been deployed, we can increment the actual number of partitions." [deployment-charts] - 10https://gerrit.wikimedia.org/r/987160 (https://phabricator.wikimedia.org/T354064) (owner: 10Peter Fischer) [17:12:07] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:13:54] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:17:07] (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:26:42] !log dancy@deploy2002 Installing scap version "4.65.1" for 567 hosts [17:28:50] !log dancy@deploy2002 Installing scap version "4.65.1" for 566 hosts [17:29:48] !log dancy@deploy2002 Installation of scap version "4.65.1" completed for 566 hosts [17:59:39] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/987156 (https://phabricator.wikimedia.org/T353419) (owner: 10BBlack) [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240102T1800) [18:01:44] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/987155 (owner: 10Volans) [18:08:16] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/output/987149/1000/vrts1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/987149 (owner: 10Muehlenhoff) [18:08:54] (03CR) 10Dzahn: [C: 03+2] doc: Switch rsync services to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984800 (owner: 10Muehlenhoff) [18:11:23] (MDRAIDFailedDisk) resolved: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [18:11:42] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: SSD failure (/dev/sd3) for aqs1013.eqiad.wmnet - https://phabricator.wikimedia.org/T354200 (10Jclark-ctr) Replaced Failed Drive [18:18:48] 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10dduvall) [18:18:59] 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10dduvall) p:05Triage→03Unbreak! [18:19:51] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: SSD failure (/dev/sd3) for aqs1013.eqiad.wmnet - https://phabricator.wikimedia.org/T354200 (10Jclark-ctr) 05Open→03Resolved [18:19:53] 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10dduvall) This is a blocker until the host is removed from the dsh targets. [18:19:58] (03CR) 10Dzahn: [C: 03+2] "this made some changes to files in /etc/ferm/conf.d such as there are no more seperate files for the IPv6 version of a rule. before we res" [puppet] - 10https://gerrit.wikimedia.org/r/984800 (owner: 10Muehlenhoff) [18:20:42] (03CR) 10Dzahn: [C: 03+2] doc: Switch rsync services to use firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/984800 (owner: 10Muehlenhoff) [18:24:22] (MDRAIDNotEnoughDisks) firing: (2) MD RAID - insufficient active disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDNotEnoughDisks [18:26:39] PROBLEM - BGP status on ssw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - No response from remote host 10.65.2.143 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:26:59] PROBLEM - Host mw2394 is DOWN: PING CRITICAL - Packet loss = 100% [18:27:13] 10SRE, 10ops-eqiad, 10observability: InterfaceSpeedError - https://phabricator.wikimedia.org/T351862 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Replaced Cable [18:29:18] !log confctl select 'name=mw2394.codfw.wmnet' set/pooled=inactive | T354193#9430654 - seems like 2396 was previously depooled instead of this 2394 [18:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:22] T354193: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 [18:30:49] 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Dzahn) depooled 2394 - per https://sal.toolforge.org/log/vbyWyowBxE1_1c7szGCe previously 2396 was depooled [18:32:40] mutante: i'm a bit confused atm. seems 2394 was removed from the mediawiki-installation dsh group but both 2394 and 2396 are still present in scap_targets which is only used for scap installation and upgrades [18:32:56] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:33:03] dduvall: I dont have any information besides being asked to depool the broken host [18:33:08] at this point [18:33:23] k. i think we're ok for train. i'll deescalate the task and remove as a blocker [18:33:33] thanks for looking into it <3 [18:33:34] I was about to move it from UBN to High, ok? [18:33:44] (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:33:51] I ran puppet on deployment hosts to see if that edits the dsh group [18:33:54] it did not [18:34:10] yes, because that is only for scap deployments then [18:34:20] so agreed, train unblocked [18:35:15] 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10dduvall) Thanks, @Dzahn. After looking a bit more, I don't think the presence in `scap_targets` should affect train, so I'm deescalating this. Whether or not depooled hosts should still be present in `scap_t... [18:35:27] 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10dduvall) p:05Unbreak!→03Medium [18:35:42] 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10dduvall) [18:35:47] (03CR) 10Gergő Tisza: [C: 03+1] Update virtual domain for url shortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987134 (owner: 10Ladsgroup) [18:36:23] sorry for the confusion! [18:36:45] 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Dzahn) p:05Medium→03High I agree the train should be unblocked and lowering it from UBN to High seems correct. Also that scap_targets should only influence scap deployment. edit: well, High or Medium :) [18:36:49] 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Dzahn) p:05High→03Medium [18:37:41] seems like scap_deployment is generated from a puppetdb query about what uhosts use mediawiki::scap or scap::target. Doesn't seem to heed repools/depools. [18:39:27] sounds about right [18:50:23] 10SRE, 10conftool: conftool no longer automatically !logs changes - https://phabricator.wikimedia.org/T354209 (10taavi) [18:50:30] 10SRE, 10conftool: conftool no longer automatically !logs changes - https://phabricator.wikimedia.org/T354209 (10taavi) a:03taavi [18:51:56] (03CR) 10Krinkle: "I'm curious what led to this patch? Is it about wanting to be auto-logged in when first visiting foundationwiki from another wiki?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987138 (https://phabricator.wikimedia.org/T205347) (owner: 10ArielGlenn) [18:53:43] (03PS1) 10Majavah: cli: Fix IRC logging [software/conftool] - 10https://gerrit.wikimedia.org/r/987167 (https://phabricator.wikimedia.org/T354209) [18:57:04] (03CR) 10CI reject: [V: 04-1] cli: Fix IRC logging [software/conftool] - 10https://gerrit.wikimedia.org/r/987167 (https://phabricator.wikimedia.org/T354209) (owner: 10Majavah) [18:57:11] (03PS2) 10Majavah: cli: Fix IRC logging [software/conftool] - 10https://gerrit.wikimedia.org/r/987167 (https://phabricator.wikimedia.org/T354209) [18:57:57] (03PS1) 10Majavah: tox: show black diff on failure [software/conftool] - 10https://gerrit.wikimedia.org/r/987170 [19:00:05] dduvall and dancy: gettimeofday() says it's time for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240102T1900) [19:01:31] o/ [19:05:22] dancy: o/ [19:05:43] Jdlrobson: thoughts about rolling train with https://phabricator.wikimedia.org/T353850 outstanding? [19:30:46] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987175 (https://phabricator.wikimedia.org/T350088) [19:30:48] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987175 (https://phabricator.wikimedia.org/T350088) (owner: 10TrainBranchBot) [19:31:46] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987175 (https://phabricator.wikimedia.org/T350088) (owner: 10TrainBranchBot) [19:38:58] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.12 refs T350088 [19:39:07] T350088: 1.42.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T350088 [19:53:42] (SystemdUnitFailed) firing: (4) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2063:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:58:42] (SystemdUnitFailed) resolved: (4) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2063:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:01:50] (03CR) 10BBlack: [C: 03+2] Make cuminx002 warning more-visible [puppet] - 10https://gerrit.wikimedia.org/r/987156 (https://phabricator.wikimedia.org/T353419) (owner: 10BBlack) [20:06:06] (03CR) 10Dzahn: [C: 03+2] phabricator: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984811 (owner: 10Muehlenhoff) [20:14:22] (MDRAIDNotEnoughDisks) resolved: (2) MD RAID - insufficient active disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDNotEnoughDisks [20:14:34] (CirrusSearchJobQueueLagTooHigh) firing: CirrusSearch job cirrusSearchLinksUpdate lag is too high: 8h 16m 2s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [20:18:00] (03CR) 10Bartosz Dziewoński: [C: 03+1] "Feel free to schedule the change for one of the available backport windows: https://wikitech.wikimedia.org/wiki/Deployments (or ask someon" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985647 (https://phabricator.wikimedia.org/T354013) (owner: 10Houseblaster) [20:24:03] 10SRE, 10ops-codfw: Inbound interface errors - ge-6/0/22 - db2099 - https://phabricator.wikimedia.org/T354155 (10Dzahn) [20:27:07] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:28:54] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:29:34] (CirrusSearchJobQueueLagTooHigh) firing: (2) CirrusSearch job cirrusSearchElasticaWrite lag is too high: 9h 30m 31s - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [20:29:59] (03PS1) 10Andrew Bogott: disable_tool: remove the archive_db stage from the cron host [puppet] - 10https://gerrit.wikimedia.org/r/987187 (https://phabricator.wikimedia.org/T353642) [20:30:28] (03CR) 10CI reject: [V: 04-1] disable_tool: remove the archive_db stage from the cron host [puppet] - 10https://gerrit.wikimedia.org/r/987187 (https://phabricator.wikimedia.org/T353642) (owner: 10Andrew Bogott) [20:30:38] (03CR) 10Dzahn: [C: 03+2] "note these are rules that allow syncing but there is no automatic syncing set up. I manually run a sync of the home dirs from /srv/homes i" [puppet] - 10https://gerrit.wikimedia.org/r/984811 (owner: 10Muehlenhoff) [20:31:48] (03PS2) 10Andrew Bogott: disable_tool: remove the archive_db stage from the cron host [puppet] - 10https://gerrit.wikimedia.org/r/987187 (https://phabricator.wikimedia.org/T353642) [20:32:28] !log phab2002 - synced /srv/homes tfrom phab1004 to /srv/homes on phab2002 [20:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:41] (03PS1) 10Ryan Kemper: elastic: prepare new hosts [puppet] - 10https://gerrit.wikimedia.org/r/987188 [20:33:29] (03PS2) 10Ryan Kemper: elastic: prepare new hosts [puppet] - 10https://gerrit.wikimedia.org/r/987188 (https://phabricator.wikimedia.org/T353878) [20:34:39] (03CR) 10Ryan Kemper: [C: 03+2] "Forgot to publish comments" [puppet] - 10https://gerrit.wikimedia.org/r/980914 (https://phabricator.wikimedia.org/T350106) (owner: 10Ryan Kemper) [20:36:00] (03PS3) 10Ryan Kemper: elastic: prepare new hosts [puppet] - 10https://gerrit.wikimedia.org/r/987188 (https://phabricator.wikimedia.org/T353878) [20:37:05] (03PS1) 10Ryan Kemper: elastic: test out elastic2087 puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/987189 [20:37:55] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:39:19] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.310 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:44:34] (CirrusSearchJobQueueLagTooHigh) firing: (2) CirrusSearch job cirrusSearchElasticaWrite lag is too high: 6h 55m 33s - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [20:49:34] (CirrusSearchJobQueueLagTooHigh) resolved: (2) CirrusSearch job cirrusSearchElasticaWrite lag is too high: 6h 55m 33s - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [20:52:00] !log mwmaint2002: `mwscript extensions/GrowthExperiments/maintenance/reassignMentees.php --wiki=enwiki --mentor 'FormalDude' --performer 'Martin Urbanec (WMF)'` (T354220) [20:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:04] T354220: User:FormalDude quit from mentorship, but their mentees were not reassigned - https://phabricator.wikimedia.org/T354220 [20:58:51] (03CR) 10Gergő Tisza: add foundationwiki to the list of central auth login wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987138 (https://phabricator.wikimedia.org/T205347) (owner: 10ArielGlenn) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240102T2100). [21:00:05] No Gerrit patches in the queue for this window AFAICS. [21:02:41] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/983955/1004/phab2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/983955 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [21:06:02] (03CR) 10Dzahn: [C: 03+2] "noop confirmed on phab1004" [puppet] - 10https://gerrit.wikimedia.org/r/983955 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [21:08:43] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2087.codfw.wmnet with OS bullseye [21:09:33] (03CR) 10Dzahn: [C: 03+2] "among the things this still did was to create the vcs systemuser, change permissions for scripts under /srv/phab/phabricator/scripts/ssh/ " [puppet] - 10https://gerrit.wikimedia.org/r/983955 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [21:17:36] (03CR) 10Bking: [C: 03+1] elastic: test out elastic2087 puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/987189 (owner: 10Ryan Kemper) [21:22:40] 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10Ottomata) +1 k! [21:28:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:33:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:58:02] (03CR) 10Andrew Bogott: [C: 03+2] designate nova_fixed_multi: create A recs using project_id and project_name [puppet] - 10https://gerrit.wikimedia.org/r/957371 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [22:25:57] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:29:28] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2087.codfw.wmnet with OS bullseye [22:34:20] (03CR) 10Bartosz Dziewoński: "Can you also share your plans for deployment and testing? I am not entirely sure what needs to be done." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [22:36:18] 10SRE, 10MW-on-K8s, 10WMF-JobQueue, 10serviceops: Moving jobs to MW-on-k8s decreased their timeout from 1200s to 200s - https://phabricator.wikimedia.org/T354229 (10Urbanecm_WMF) [22:37:15] 10SRE, 10MW-on-K8s, 10WMF-JobQueue, 10serviceops: Moving jobs to MW-on-k8s decreased their timeout from 1200s to 200s - https://phabricator.wikimedia.org/T354229 (10Urbanecm_WMF) [22:41:18] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796 (10Urbanecm_WMF) I think the k8s migration work as part of this ticket caused {T354229}. [22:42:38] !log mwmaint2002: Restart `mwscript extensions/GrowthExperiments/maintenance/reassignMentees.php --wiki=enwiki --mentor 'FormalDude' --performer 'Martin Urbanec (WMF)'` (T354220) [22:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:41] T354220: User:FormalDude quit from mentorship, but their mentees were not reassigned - https://phabricator.wikimedia.org/T354220 [22:59:48] 10SRE, 10MW-on-K8s, 10WMF-JobQueue, 10serviceops: Moving jobs to MW-on-k8s decreased their timeout from 1200s to 200s - https://phabricator.wikimedia.org/T354229 (10Urbanecm_WMF) [23:17:39] (03CR) 10Gergő Tisza: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [23:24:27] PROBLEM - WDQS SPARQL on wdqs1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:30:42] 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown: Varnish: Mobile site redirect interferes with OAuth authorization process - https://phabricator.wikimedia.org/T74186 (10Tgr) [23:32:05] (03PS1) 10Ebernhardson: team-search-platform: Update job queue alerts to use histogram [alerts] - 10https://gerrit.wikimedia.org/r/987206 [23:34:20] (03PS2) 10Ebernhardson: team-search-platform: Update job queue alerts to use histogram [alerts] - 10https://gerrit.wikimedia.org/r/987206 [23:50:43] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 126, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:51:29] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down