[00:27:06] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:31:38] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[00:32:06] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:38:51] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/986828
[00:38:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/986828 (owner: 10TrainBranchBot)
[00:42:03] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:46:27] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:59:25] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/986828 (owner: 10TrainBranchBot)
[01:03:51] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T354155 (10phaultfinder)
[01:40:31] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:06:44] <wikibugs>	 10SRE, 10Commons, 10Traffic-Icebox, 10Wikimedia-Site-requests, and 2 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517 (10Pols12) Wikidata does not use `uselang` hack: this task is specific to Commons. See T58464 for a global task (al...
[02:37:06] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:49:45] <jinxer-wm>	 (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable
[03:07:22] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.12 [core] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/986829 (https://phabricator.wikimedia.org/T350088)
[03:07:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.42.0-wmf.12 [core] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/986829 (https://phabricator.wikimedia.org/T350088) (owner: 10TrainBranchBot)
[03:08:54] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:26:37] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.12 [core] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/986829 (https://phabricator.wikimedia.org/T350088) (owner: 10TrainBranchBot)
[04:01:58] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986869 (https://phabricator.wikimedia.org/T350088)
[04:02:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986869 (https://phabricator.wikimedia.org/T350088) (owner: 10TrainBranchBot)
[04:03:26] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986869 (https://phabricator.wikimedia.org/T350088) (owner: 10TrainBranchBot)
[04:03:51] <logmsgbot>	 !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.12  refs T350088
[04:04:01] <stashbot>	 T350088: 1.42.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T350088
[04:19:30] <jinxer-wm>	 (KubernetesAPINotScrapable) firing: (4) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable
[04:31:38] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[04:47:55] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:00:40] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.12  refs T350088 (duration: 56m 48s)
[05:00:47] <stashbot>	 T350088: 1.42.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T350088
[05:11:03] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Performance Issue: https://lists.wikimedia.org/postorius is sloooow - https://phabricator.wikimedia.org/T353891 (10Ladsgroup) I see the problem in two areas only:  - Opening the main page  - Opening the page for a mailing list with a lot of members.  For the first one, it...
[05:13:57] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 628.63 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:14:25] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:23:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw appserver POST/200: 2.21186413264781s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=appserver&var-method=POST - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[05:28:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw appserver POST/200: 2.21186413264781s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=appserver&var-method=POST - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceede
[05:40:33] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:57:39] <wikibugs>	 (03PS1) 10Marostegui: installserver: Do not format dbstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/986871
[06:01:37] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] installserver: Do not format dbstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/986871 (owner: 10Marostegui)
[06:06:02] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2144.codfw.wmnet with OS bookworm
[06:12:18] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) @ABran-WMF @MoritzMuehlenhoff we are going to have to give this more priority, dbstore1003 (s1) is now failing in orchestrator as be...
[06:13:34] <jinxer-wm>	 (CirrusSearchJobQueueLagTooHigh) firing: CirrusSearch job cirrusSearchLinksUpdate lag is too high: 6h 15m 4s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh
[06:18:34] <jinxer-wm>	 (CirrusSearchJobQueueLagTooHigh) resolved: CirrusSearch job cirrusSearchLinksUpdate lag is too high: 6h 9m 41s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh
[06:24:16] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2144.codfw.wmnet with reason: host reimage
[06:27:18] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2144.codfw.wmnet with reason: host reimage
[06:45:09] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2144.codfw.wmnet with OS bookworm
[08:00:04] <jouncebot>	 Amir1 and Urbanecm: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240102T0800). nyaa~
[08:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:12:17] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:13:49] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:16:37] <wikibugs>	 (03PS1) 10Peter Fischer: configure message_key_fields for update_pipeline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987028
[08:17:17] <wikibugs>	 (03CR) 10Peter Fischer: [C: 03+2] configure message_key_fields for update_pipeline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987028 (owner: 10Peter Fischer)
[08:18:00] <wikibugs>	 (03Merged) 10jenkins-bot: configure message_key_fields for update_pipeline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987028 (owner: 10Peter Fischer)
[08:18:11] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51307 bytes in 6.414 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:19:31] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.313 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:19:45] <jinxer-wm>	 (KubernetesAPINotScrapable) firing: (4) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable
[08:20:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[08:25:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[08:25:31] <wikibugs>	 (03CR) 10Sohom Datta: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984288 (https://phabricator.wikimedia.org/T341388) (owner: 10Houseblaster)
[08:26:04] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.hosts.provision for host mw2448.mgmt.codfw.wmnet with reboot policy GRACEFUL
[08:26:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[08:27:23] <jayme>	 !log restart prometheus@k8s prometheus@k8s-aux in eqiad - T343529
[08:27:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:27] <stashbot>	 T343529: Prometheus doesn't reload or alert on expired client certificates - https://phabricator.wikimedia.org/T343529
[08:27:49] <pfischer>	 Amir1: Hi! I accidentally +2ed a minor config change that was intended to be deployed as part of a back port window. (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/987028) Does that get deployed now or do I still have to request a deployment explicitly?
[08:28:35] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:30:24] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[08:31:23] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.301 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:31:38] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[08:33:41] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2448.mgmt.codfw.wmnet with reboot policy GRACEFUL
[08:34:27] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Remove throttle exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987031 (https://phabricator.wikimedia.org/T352569)
[08:34:29] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Use shellbox for djvu handling on kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987032 (https://phabricator.wikimedia.org/T352515)
[08:34:30] <jinxer-wm>	 (KubernetesAPINotScrapable) resolved: (4) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable
[08:34:31] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Always process media files via shellbox on k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987033 (https://phabricator.wikimedia.org/T352515)
[08:35:21] <_joe_>	 jouncebot: nowandnext
[08:35:21] <jouncebot>	 For the next 0 hour(s) and 24 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240102T0800)
[08:35:21] <jouncebot>	 In 2 hour(s) and 24 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240102T1100)
[08:35:39] <_joe_>	 uhm I guess no one's around
[08:39:31] <wikibugs>	 (03PS2) 10Brouberol: spark-history: add availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/984223 (https://phabricator.wikimedia.org/T353717)
[08:41:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[08:41:43] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: mw2448.codfw.wmnet is down - https://phabricator.wikimedia.org/T353679 (10akosiaris) The machine isn't pooled yet into traffic. There is an alert for frequent changes due to puppet run. Indeed the following happens at every puppet run  `Notice: /Stage[main]/Cpufrequtils/Exec...
[08:43:39] <wikibugs>	 (03CR) 10Brouberol: "It _seems_ that dual stack for Services without selectors is supported and does what we want by default: https://kubernetes.io/docs/concep" [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[09:00:40] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Riddy Khan - https://phabricator.wikimedia.org/T353370 (10MoritzMuehlenhoff) I've corrected the group membership; contractors should be in cn=wmf, the cn=nda LDAP group is for community members w...
[09:02:22] <logmsgbot>	 !log pfischer@deploy2002 Started scap: Backport for [[gerrit:987028|configure message_key_fields for update_pipeline]]
[09:02:54] <moritzm>	 !log installing nodejs security updates on bookworm
[09:02:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:09] <logmsgbot>	 !log pfischer@deploy2002 pfischer: Backport for [[gerrit:987028|configure message_key_fields for update_pipeline]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[09:05:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:05:53] <logmsgbot>	 !log pfischer@deploy2002 pfischer: Continuing with sync
[09:06:08] <taavi>	 pfischer: if you accidentally +2 a mw-config patch, you need to either immediately deploy it or immediately revert and pull the reverting commit to deploy2002
[09:06:37] <pfischer>	 taavi: Thanks! Deployment is already in progress.
[09:09:54] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] [namespaces] Use correct diacritics in Romanian (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972473 (https://phabricator.wikimedia.org/T350739) (owner: 10Strainu)
[09:10:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:10:39] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:11:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[09:17:57] <logmsgbot>	 !log pfischer@deploy2002 Finished scap: Backport for [[gerrit:987028|configure message_key_fields for update_pipeline]] (duration: 15m 35s)
[09:20:24] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:21:00] <wikibugs>	 (03PS1) 10Btullis: Bring dbstore1008 into service [puppet] - 10https://gerrit.wikimedia.org/r/987116 (https://phabricator.wikimedia.org/T351921)
[09:21:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff)
[09:22:00] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] Bring dbstore1008 into service [puppet] - 10https://gerrit.wikimedia.org/r/987116 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis)
[09:23:27] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Commissioning new database server
[09:23:41] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Commissioning new database server
[09:23:55] <wikibugs>	 (03PS1) 10Marostegui: dbstore1008: Add sections [puppet] - 10https://gerrit.wikimedia.org/r/987117 (https://phabricator.wikimedia.org/T351921)
[09:24:58] <wikibugs>	 10SRE, 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T353913 (10phaultfinder)
[09:26:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] check_wmf_styleguide: Remove check to enforce presence of system::role [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/983689 (owner: 10Muehlenhoff)
[09:31:51] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Hardware): Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10fnegri) @Jclark-ctr the host was restarted on Dec 22 at 18:29 UTC. Has the CPU been replaced?
[09:35:04] <wikibugs>	 (03PS2) 10Btullis: Bring dbstore1008 into service [puppet] - 10https://gerrit.wikimedia.org/r/987116 (https://phabricator.wikimedia.org/T351921)
[09:37:18] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] spark-history: add availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/984223 (https://phabricator.wikimedia.org/T353717) (owner: 10Brouberol)
[09:41:13] <wikibugs>	 (03PS3) 10Btullis: Bring dbstore1008 into service [puppet] - 10https://gerrit.wikimedia.org/r/987116 (https://phabricator.wikimedia.org/T351921)
[09:41:54] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] spark-history: add availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/984223 (https://phabricator.wikimedia.org/T353717) (owner: 10Brouberol)
[09:42:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] Bring dbstore1008 into service [puppet] - 10https://gerrit.wikimedia.org/r/987116 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis)
[09:44:57] <wikibugs>	 (03PS4) 10Btullis: Bring dbstore1008 into service [puppet] - 10https://gerrit.wikimedia.org/r/987116 (https://phabricator.wikimedia.org/T351921)
[09:45:20] <marostegui>	 btullis: https://gerrit.wikimedia.org/r/c/operations/puppet/+/987117
[09:46:08] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/990/con" [puppet] - 10https://gerrit.wikimedia.org/r/987116 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis)
[09:46:10] <btullis>	 marostegui: I just added the same to https://gerrit.wikimedia.org/r/c/operations/puppet/+/987116 based on the pcc output.
[09:46:28] <btullis>	 Sorry for duplicating
[09:46:33] <wikibugs>	 (03CR) 10Marostegui: [C: 04-1] "Please disable notifications for now" [puppet] - 10https://gerrit.wikimedia.org/r/987116 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis)
[09:46:46] <wikibugs>	 (03Abandoned) 10Marostegui: dbstore1008: Add sections [puppet] - 10https://gerrit.wikimedia.org/r/987117 (https://phabricator.wikimedia.org/T351921) (owner: 10Marostegui)
[09:47:21] <wikibugs>	 (03PS5) 10Btullis: Bring dbstore1008 into service [puppet] - 10https://gerrit.wikimedia.org/r/987116 (https://phabricator.wikimedia.org/T351921)
[09:47:30] <wikibugs>	 10SRE, 10Discovery-Search, 10serviceops, 10Patch-For-Review: SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10pfischer)
[09:47:53] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] Bring dbstore1008 into service [puppet] - 10https://gerrit.wikimedia.org/r/987116 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis)
[09:48:39] <wikibugs>	 10SRE, 10serviceops, 10Discovery-Search (Current work), 10Patch-For-Review: SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10pfischer)
[09:49:33] <wikibugs>	 10SRE, 10serviceops, 10Discovery-Search (Current work), 10Patch-For-Review: SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10pfischer) 05Open→03In progress p:05Triage→03High
[10:01:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] rsync::quickdatacopy: Add support for creating nftables-compatible firewall [puppet] - 10https://gerrit.wikimedia.org/r/984615 (owner: 10Muehlenhoff)
[10:11:34] <wikibugs>	 (03CR) 10Slyngshede: C:puppetmaster::monitoring Prometheus stats for Puppetmerge. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/983376 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[10:11:48] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Add instance to summary for NTP [alerts] - 10https://gerrit.wikimedia.org/r/981175 (owner: 10Slyngshede)
[10:13:01] <wikibugs>	 (03Merged) 10jenkins-bot: Add instance to summary for NTP [alerts] - 10https://gerrit.wikimedia.org/r/981175 (owner: 10Slyngshede)
[10:15:20] <Amir1>	 pfischer: sorry I was asleep, yeah go ahead please (I think you already did)
[10:18:13] <wikibugs>	 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer)
[10:21:04] <wikibugs>	 (03CR) 10Jelto: "The Dockerfile templates for buster and bookworm also use this workaround. Is /usr/share/man/man1 needed on all Debian distributions or is" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/984297 (https://phabricator.wikimedia.org/T352003) (owner: 10BCornwall)
[10:24:15] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Bring dbstore1008 into service [puppet] - 10https://gerrit.wikimedia.org/r/987116 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis)
[10:25:04] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/991/con" [puppet] - 10https://gerrit.wikimedia.org/r/984618 (owner: 10Muehlenhoff)
[10:26:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] os-reports: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984613 (owner: 10Muehlenhoff)
[10:26:43] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM for the python part" [puppet] - 10https://gerrit.wikimedia.org/r/983376 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[10:28:57] <wikibugs>	 (03PS1) 10Hashar: gerrit: make LDAP groups visible to users [puppet] - 10https://gerrit.wikimedia.org/r/987120 (https://phabricator.wikimedia.org/T354069)
[10:36:14] <wikibugs>	 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer) 05In progress→03Open a:05pfischer→03None
[10:38:10] <vgutierrez>	 !log fetching haproxy 2.6.16 for thirdparty/haproxy26 bullseye-wikimedia (apt.wm.o)
[10:38:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:51] <wikibugs>	 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) a:03brouberol
[10:46:21] <wikibugs>	 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer)
[10:50:24] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work), 10Patch-For-Review: SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10pfischer) a:05pfischer→03None
[10:50:51] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp4044.ulsfo.wmnet,cp4050.ulsfo.wmnet} and A:cp
[10:50:54] <urbanecm>	 jouncebot: nowandnext
[10:50:54] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 9 minute(s)
[10:50:54] <jouncebot>	 In 0 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240102T1100)
[10:55:58] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp4044.ulsfo.wmnet,cp4050.ulsfo.wmnet} and A:cp
[10:56:49] <wikibugs>	 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) ` brouberol@kafka-jumbo1010:~$ kafka configs --entity-type topics --entity-name 'eqiad.me...
[10:58:22] <wikibugs>	 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer)
[10:58:30] <wikibugs>	 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) We can see the impact on the overall topic size {F41648651}
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240102T1100)
[11:01:32] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984618 (owner: 10Muehlenhoff)
[11:13:35] <wikibugs>	 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) ` brouberol@kafka-jumbo1010:~$ kafka configs --entity-type topics --entity-name 'codfw.me...
[11:17:53] <wikibugs>	 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) 22% of the topic segments were compacted and deleted: {F41648664}
[11:18:30] <wikibugs>	 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol)
[11:32:55] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] Use shellbox for djvu handling on kubernetes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987032 (https://phabricator.wikimedia.org/T352515) (owner: 10Giuseppe Lavagetto)
[11:33:54] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Disk (sdh) failed in ms-be2068 - https://phabricator.wikimedia.org/T354180 (10MatthewVernon)
[11:34:06] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Disk (sdh) failed in ms-be2068 - https://phabricator.wikimedia.org/T354180 (10MatthewVernon) p:05Triage→03High
[11:46:13] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:puppetmaster::monitoring Blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/983713 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[11:46:15] <wikibugs>	 (03PS3) 10Muehlenhoff: aptrepo::staging: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984251
[11:55:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] aptrepo::staging: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984251 (owner: 10Muehlenhoff)
[12:07:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] an-web: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984252 (owner: 10Muehlenhoff)
[12:08:45] <wikibugs>	 (03PS2) 10Muehlenhoff: statistics::explorer::misc_jobs: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984818
[12:12:09] <wikibugs>	 (03PS1) 10Brouberol: spark-history: Remove stale comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/987132
[12:14:06] <wikibugs>	 (03PS7) 10Slyngshede: C:puppetmaster::monitoring Prometheus stats for Puppetmerge. [puppet] - 10https://gerrit.wikimedia.org/r/983376 (https://phabricator.wikimedia.org/T350694)
[12:17:23] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] phabricator: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984811 (owner: 10Muehlenhoff)
[12:17:46] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] spark-history: Remove stale comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/987132 (owner: 10Brouberol)
[12:17:55] <wikibugs>	 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol)
[12:18:02] <wikibugs>	 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) 05Open→03Resolved The change has been applied an hour ago (at the line). We don't obs...
[12:19:42] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] statistics::rsyncd: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984803 (owner: 10Muehlenhoff)
[12:20:59] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] statistics::explorer::misc_jobs: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984818 (owner: 10Muehlenhoff)
[12:29:32] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] spark-history: Remove stale comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/987132 (owner: 10Brouberol)
[12:31:38] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[12:36:42] <wikibugs>	 (03PS1) 10Ladsgroup: Update virtual domain for url shortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987134
[12:40:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Add a comment which clarifies the purpose of the kadmin rsync setup [puppet] - 10https://gerrit.wikimedia.org/r/987136
[12:41:39] <wikibugs>	 (03CR) 10Jforrester: "<3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984277 (https://phabricator.wikimedia.org/T253216) (owner: 10Ladsgroup)
[12:43:33] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984617 (owner: 10Muehlenhoff)
[12:51:06] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[12:58:16] <wikibugs>	 (03PS1) 10ArielGlenn: add foundationwiki to the list of central auth login wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987138 (https://phabricator.wikimedia.org/T205347)
[13:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240102T1300)
[13:12:54] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/984617 (owner: 10Muehlenhoff)
[13:13:59] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987138 (https://phabricator.wikimedia.org/T205347) (owner: 10ArielGlenn)
[13:21:06] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[13:23:22] <wikibugs>	 (03PS1) 10Btullis: Upgrade dbstore100[89] to mariadb 10.6 with reimage [puppet] - 10https://gerrit.wikimedia.org/r/987139 (https://phabricator.wikimedia.org/T351921)
[13:23:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] kerberos::kdc: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984617 (owner: 10Muehlenhoff)
[13:24:44] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/987139 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis)
[13:25:13] <wikibugs>	 (03CR) 10Marostegui: [C: 04-1] "You don't need to specify the 106 package if they will be reimaged to bookworm." [puppet] - 10https://gerrit.wikimedia.org/r/987139 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis)
[13:27:33] <wikibugs>	 (03PS2) 10Btullis: Upgrade dbstore100[89] to mariadb 10.6 with reimage [puppet] - 10https://gerrit.wikimedia.org/r/987139 (https://phabricator.wikimedia.org/T351921)
[13:28:08] <wikibugs>	 (03PS3) 10Btullis: Upgrade dbstore100[89] to mariadb 10.6 with reimage [puppet] - 10https://gerrit.wikimedia.org/r/987139 (https://phabricator.wikimedia.org/T351921)
[13:28:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] Upgrade dbstore100[89] to mariadb 10.6 with reimage [puppet] - 10https://gerrit.wikimedia.org/r/987139 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis)
[13:31:17] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Upgrade dbstore100[89] to mariadb 10.6 with reimage [puppet] - 10https://gerrit.wikimedia.org/r/987139 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis)
[13:31:41] <wikibugs>	 (03PS1) 10Aklapper: phabricator: Yearly metrics for wikitech-l: Correct strings [puppet] - 10https://gerrit.wikimedia.org/r/987140
[13:38:45] <wikibugs>	 (03PS1) 10Aklapper: phabricator weekly changes email: Exclude listing some WMCS team tags [puppet] - 10https://gerrit.wikimedia.org/r/987141
[13:44:22] <wikibugs>	 (03PS1) 10Aklapper: phabricator weekly changes email: Explain why some queries are listed [puppet] - 10https://gerrit.wikimedia.org/r/987143
[13:54:42] <wikibugs>	 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10Ottomata) Interesting!  Curious, so the reason for using compaction here is just to save space, not...
[14:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240102T1400).
[14:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[14:00:13] <urbanecm>	 i'll sneak something out
[14:00:33] <wikibugs>	 (03PS2) 10Urbanecm: cswiki: Grant patrolmarks to autopatrolled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985384 (https://phabricator.wikimedia.org/T354004)
[14:00:35] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] cswiki: Grant patrolmarks to autopatrolled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985384 (https://phabricator.wikimedia.org/T354004) (owner: 10Urbanecm)
[14:00:41] <wikibugs>	 (03PS2) 10Urbanecm: csbwiktionary: Set MetaNamespaceName to Wikisłowôrz [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986640 (https://phabricator.wikimedia.org/T354114)
[14:01:17] <wikibugs>	 (03Merged) 10jenkins-bot: cswiki: Grant patrolmarks to autopatrolled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985384 (https://phabricator.wikimedia.org/T354004) (owner: 10Urbanecm)
[14:01:20] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] csbwiktionary: Set MetaNamespaceName to Wikisłowôrz [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986640 (https://phabricator.wikimedia.org/T354114) (owner: 10Urbanecm)
[14:02:05] <wikibugs>	 (03Merged) 10jenkins-bot: csbwiktionary: Set MetaNamespaceName to Wikisłowôrz [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986640 (https://phabricator.wikimedia.org/T354114) (owner: 10Urbanecm)
[14:02:46] <logmsgbot>	 !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:985384|cswiki: Grant patrolmarks to autopatrolled (T354004)]], [[gerrit:986640|csbwiktionary: Set MetaNamespaceName to Wikisłowôrz (T354114)]]
[14:02:52] <stashbot>	 T354004: Grant `patrolmarks` to autopatrolled at Czech Wikipedia - https://phabricator.wikimedia.org/T354004
[14:02:52] <stashbot>	 T354114: Localised name for csb wiktionary - https://phabricator.wikimedia.org/T354114
[14:04:20] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:985384|cswiki: Grant patrolmarks to autopatrolled (T354004)]], [[gerrit:986640|csbwiktionary: Set MetaNamespaceName to Wikisłowôrz (T354114)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:04:46] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Continuing with sync
[14:08:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] statistics::rsyncd: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984803 (owner: 10Muehlenhoff)
[14:15:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add a comment which clarifies the purpose of the kadmin rsync setup [puppet] - 10https://gerrit.wikimedia.org/r/987136 (owner: 10Muehlenhoff)
[14:16:33] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:985384|cswiki: Grant patrolmarks to autopatrolled (T354004)]], [[gerrit:986640|csbwiktionary: Set MetaNamespaceName to Wikisłowôrz (T354114)]] (duration: 13m 46s)
[14:16:38] <stashbot>	 T354004: Grant `patrolmarks` to autopatrolled at Czech Wikipedia - https://phabricator.wikimedia.org/T354004
[14:16:38] <stashbot>	 T354114: Localised name for csb wiktionary - https://phabricator.wikimedia.org/T354114
[14:16:45] <urbanecm>	 okay... `ssh: connect to host mw2394.codfw.wmnet port 22`
[14:17:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] statistics::explorer::misc_jobs: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984818 (owner: 10Muehlenhoff)
[14:19:12] <urbanecm>	 which...appears to be pooled as a jobrunner, but unavailable?
[14:20:51] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host dbstore1008.eqiad.wmnet with OS bookworm
[14:22:33] <wikibugs>	 (03CR) 10Xcollazo: "It is a bit hard to follow what actually changed since whitespace also changed. Now the file has a mix of spaces and tabs." [puppet] - 10https://gerrit.wikimedia.org/r/986181 (owner: 10Ladsgroup)
[14:23:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] graphite::production: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984248 (owner: 10Muehlenhoff)
[14:26:52] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host dbstore1009.eqiad.wmnet with OS bookworm
[14:28:12] <wikibugs>	 (03PS2) 10Muehlenhoff: doc: Switch rsync services to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984800
[14:28:47] <wikibugs>	 (03PS3) 10Muehlenhoff: swift: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984516
[14:32:09] <_joe_>	 !log confctl select 'name=mw2396.codfw.wmnet' set/pooled=inactive
[14:32:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:34] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbstore1008.eqiad.wmnet with reason: host reimage
[14:34:55] <wikibugs>	 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10MoritzMuehlenhoff)
[14:35:55] <moritzm>	 _joe_: there's a typo, the broken server in need to depool is mw2394, not 2396
[14:36:52] <_joe_>	 duh
[14:36:52] <wikibugs>	 (03CR) 10Bking: [C: 03+2] wdqs: graph split hosts don't need categories [puppet] - 10https://gerrit.wikimedia.org/r/984648 (https://phabricator.wikimedia.org/T352878) (owner: 10Ryan Kemper)
[14:37:07] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:37:58] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1008.eqiad.wmnet with reason: host reimage
[14:39:49] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Retrict access to the spark-history k8s API tokens [puppet] - 10https://gerrit.wikimedia.org/r/984130 (https://phabricator.wikimedia.org/T330176) (owner: 10Btullis)
[14:40:00] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbstore1009.eqiad.wmnet with reason: host reimage
[14:43:20] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1009.eqiad.wmnet with reason: host reimage
[14:44:46] <urbanecm>	 !log [urbanecm@mwmaint2002 ~]$ mwscript namespaceDupes.php --wiki=csbwiktionary --fix # T354114
[14:44:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:50] <stashbot>	 T354114: Localised name for csb wiktionary - https://phabricator.wikimedia.org/T354114
[14:51:38] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Performance Issue, 10Upstream: https://lists.wikimedia.org/postorius is sloooow - https://phabricator.wikimedia.org/T353891 (10Reedy)
[14:57:07] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:58:10] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbstore1009.eqiad.wmnet with OS bookworm
[14:59:05] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbstore1008.eqiad.wmnet with OS bookworm
[15:02:30] <wikibugs>	 (03Abandoned) 10Samtar: wikimedia.org: add fox. [dns] - 10https://gerrit.wikimedia.org/r/980935 (https://phabricator.wikimedia.org/T352870) (owner: 10Samtar)
[15:02:44] <wikibugs>	 (03Abandoned) 10Samtar: redirects: Add funnel for fox.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/980879 (https://phabricator.wikimedia.org/T352870) (owner: 10Samtar)
[15:04:03] <wikibugs>	 (03PS1) 10Muehlenhoff: vtrs: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987149
[15:05:24] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987149 (owner: 10Muehlenhoff)
[15:08:19] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Hardware): Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Jclark-ctr) @fnegri  yes cpu was replaced
[15:10:42] <wikibugs>	 (03PS1) 10Muehlenhoff: aptrepo:migration: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987150
[15:14:15] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Hardware): Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Jclark-ctr) cpu was replaced by dell on Dec 22. performed cpu self test multiple times with no errors,  Also tech did swap cpu1  and cpu2 locations.
[15:14:27] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Hardware): Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Jclark-ctr) 05Open→03Resolved
[15:15:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] aptrepo:migration: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987150 (owner: 10Muehlenhoff)
[15:19:38] <wikibugs>	 (03PS1) 10Muehlenhoff: lists: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987151
[15:22:55] <wikibugs>	 (03PS1) 10Btullis: Migrate analytics-hive to a new coordinator [dns] - 10https://gerrit.wikimedia.org/r/987152 (https://phabricator.wikimedia.org/T336045)
[15:24:39] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987151 (owner: 10Muehlenhoff)
[15:27:09] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984800 (owner: 10Muehlenhoff)
[15:27:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] lists: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987151 (owner: 10Muehlenhoff)
[15:29:36] <wikibugs>	 (03PS1) 10Muehlenhoff: prometheus::migration: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987153
[15:31:12] <wikibugs>	 (03PS3) 10Muehlenhoff: failoid: Remove system::role [puppet] - 10https://gerrit.wikimedia.org/r/983687
[15:31:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] failoid: Remove system::role [puppet] - 10https://gerrit.wikimedia.org/r/983687 (owner: 10Muehlenhoff)
[15:32:25] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "I am not quite sure what is going please deploy whenever it fits :)" [puppet] - 10https://gerrit.wikimedia.org/r/984800 (owner: 10Muehlenhoff)
[15:32:43] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987153 (owner: 10Muehlenhoff)
[15:34:27] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] "I checked that an-coord1003 is in service and running debian 11. Looks good!" [dns] - 10https://gerrit.wikimedia.org/r/987152 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis)
[15:35:10] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Migrate analytics-hive to a new coordinator [dns] - 10https://gerrit.wikimedia.org/r/987152 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis)
[15:37:58] <wikibugs>	 (03PS1) 10Muehlenhoff: Bump access for daniram [puppet] - 10https://gerrit.wikimedia.org/r/987154
[15:39:22] <wikibugs>	 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer) @Ottomata, yes, this was intended to a) save disk space and b) reduce the number of record...
[15:40:56] <wikibugs>	 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10Ottomata) Are you sure you want `delete` in the policy then?  Perhaps you want to keep all the lates...
[15:43:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Bump access for daniram [puppet] - 10https://gerrit.wikimedia.org/r/987154 (owner: 10Muehlenhoff)
[15:43:37] <wikibugs>	 (03CR) 10Samtar: [C: 03+1] Add "patroller" user group to testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986200 (https://phabricator.wikimedia.org/T354063) (owner: 10Novem Linguae)
[15:43:42] <wikibugs>	 (03PS2) 10Samtar: Add "patroller" user group to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/986200 (https://phabricator.wikimedia.org/T354063) (owner: 10Novem Linguae)
[15:47:15] <wikibugs>	 10SRE, 10ops-eqiad: SMART errors on ganeti1031 - https://phabricator.wikimedia.org/T353324 (10Jclark-ctr) @MoritzMuehlenhoff  would like to swap drive today if your available
[15:48:54] <wikibugs>	 10SRE, 10ops-eqiad: SMART errors on ganeti1031 - https://phabricator.wikimedia.org/T353324 (10MoritzMuehlenhoff) >>! In T353324#9430279, @Jclark-ctr wrote: > @MoritzMuehlenhoff  would like to swap drive today if your available   Ack, please go ahead.
[16:00:04] <jouncebot>	 eoghan, jelto, and arnoldokoth: May I have your attention please! SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240102T1600)
[16:09:11] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:10:20] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10hashar) That hosts also broke during the MediaWiki train: ` 04:55:49 Started sync_wikiversions 04:55:49 sync_wikiversions:   0% (ok: 0; fail: 0; left: 374)                     04:58:04 sudo -u mwdeploy -n --...
[16:21:05] <wikibugs>	 (03PS2) 10BCornwall: wmf-debci: Also create man1 dir [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/984297 (https://phabricator.wikimedia.org/T352003)
[16:21:27] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:22:41] <wikibugs>	 (03CR) 10BCornwall: wmf-debci: Also create man1 dir (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/984297 (https://phabricator.wikimedia.org/T352003) (owner: 10BCornwall)
[16:22:44] <wikibugs>	 (03PS1) 10Volans: Use setuptools_scm to set the version [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/987155
[16:25:21] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Use shellbox for djvu handling on kubernetes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987032 (https://phabricator.wikimedia.org/T352515) (owner: 10Giuseppe Lavagetto)
[16:26:17] <wikibugs>	 10SRE, 10ops-eqiad: SMART errors on ganeti1031 - https://phabricator.wikimedia.org/T353324 (10Jclark-ctr) 05Open→03Resolved Replaced Drive
[16:27:05] <icinga-wm>	 PROBLEM - MD RAID on ganeti1031 is CRITICAL: CRITICAL: State: degraded, Active: 9, Working: 9, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[16:28:09] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:30:32] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm, but I also hope we can remove this java-specific workaround at some point once it's fixed upstream." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/984297 (https://phabricator.wikimedia.org/T352003) (owner: 10BCornwall)
[16:31:39] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[16:34:17] <wikibugs>	 (03PS1) 10BBlack: Make cuminx002 warning more-visible [puppet] - 10https://gerrit.wikimedia.org/r/987156 (https://phabricator.wikimedia.org/T353419)
[16:34:31] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on dumpsdata1006 - https://phabricator.wikimedia.org/T354143 (10Jclark-ctr) a:03Jclark-ctr Confirmed: Service Request 182576745 was successfully submitted.
[16:34:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Make cuminx002 warning more-visible [puppet] - 10https://gerrit.wikimedia.org/r/987156 (https://phabricator.wikimedia.org/T353419) (owner: 10BBlack)
[16:37:45] <wikibugs>	 (03PS1) 10Btullis: Bring dbstore1009 into service [puppet] - 10https://gerrit.wikimedia.org/r/987157 (https://phabricator.wikimedia.org/T351924)
[16:37:53] <wikibugs>	 (03PS1) 10Jgiannelos: wikifeeds: Use core page HTML in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/987158
[16:38:09] <wikibugs>	 (03PS2) 10BBlack: Make cuminx002 warning more-visible [puppet] - 10https://gerrit.wikimedia.org/r/987156 (https://phabricator.wikimedia.org/T353419)
[16:39:20] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/996/console" [puppet] - 10https://gerrit.wikimedia.org/r/987157 (https://phabricator.wikimedia.org/T351924) (owner: 10Btullis)
[16:40:56] <wikibugs>	 (03PS2) 10Jgiannelos: wikifeeds: Use core page HTML in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/987158
[16:41:14] <wikibugs>	 (03PS2) 10Btullis: Bring dbstore1009 into service [puppet] - 10https://gerrit.wikimedia.org/r/987157 (https://phabricator.wikimedia.org/T351924)
[16:42:48] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/997/con" [puppet] - 10https://gerrit.wikimedia.org/r/987157 (https://phabricator.wikimedia.org/T351924) (owner: 10Btullis)
[16:44:51] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:48:58] <wikibugs>	 (03PS1) 10Phuedx: Add agent.app_install_id to android.product_metrics.* streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987159 (https://phabricator.wikimedia.org/T353680)
[16:52:10] <wikibugs>	 (03PS3) 10BBlack: Make cuminx002 warning more-visible [puppet] - 10https://gerrit.wikimedia.org/r/987156 (https://phabricator.wikimedia.org/T353419)
[16:52:36] <wikibugs>	 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer) @Ottomata, we considered this but but decided against it since  a) page_rerender is only o...
[16:53:03] <wikibugs>	 (03CR) 10BBlack: "PS3 works as intended now (manually verified with output from compiler on the host)" [puppet] - 10https://gerrit.wikimedia.org/r/987156 (https://phabricator.wikimedia.org/T353419) (owner: 10BBlack)
[17:00:04] <jouncebot>	 jhathaway and rzl: OwO what's this, a deployment window?? Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240102T1700). nyaa~
[17:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:06:35] <wikibugs>	 10ops-eqiad, 10DC-Ops: hw troubleshooting: SSD failure (/dev/sd3) for aqs1013.eqiad.wmnet - https://phabricator.wikimedia.org/T354200 (10Eevans)
[17:06:40] <wikibugs>	 (03PS1) 10Peter Fischer: Search update pipeline: enable kafka partition discovery [deployment-charts] - 10https://gerrit.wikimedia.org/r/987160 (https://phabricator.wikimedia.org/T354064)
[17:08:28] <wikibugs>	 (03CR) 10Peter Fischer: "Once this has been deployed, we can increment the actual number of partitions." [deployment-charts] - 10https://gerrit.wikimedia.org/r/987160 (https://phabricator.wikimedia.org/T354064) (owner: 10Peter Fischer)
[17:12:07] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:13:54] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:17:07] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:26:42] <logmsgbot>	 !log dancy@deploy2002 Installing scap version "4.65.1" for 567 hosts
[17:28:50] <logmsgbot>	 !log dancy@deploy2002 Installing scap version "4.65.1" for 566 hosts
[17:29:48] <logmsgbot>	 !log dancy@deploy2002 Installation of scap version "4.65.1" completed for 566 hosts
[17:59:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/987156 (https://phabricator.wikimedia.org/T353419) (owner: 10BBlack)
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240102T1800)
[18:01:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/987155 (owner: 10Volans)
[18:08:16] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/output/987149/1000/vrts1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/987149 (owner: 10Muehlenhoff)
[18:08:54] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] doc: Switch rsync services to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984800 (owner: 10Muehlenhoff)
[18:11:23] <jinxer-wm>	 (MDRAIDFailedDisk) resolved: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[18:11:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: SSD failure (/dev/sd3) for aqs1013.eqiad.wmnet - https://phabricator.wikimedia.org/T354200 (10Jclark-ctr) Replaced  Failed Drive
[18:18:48] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10dduvall)
[18:18:59] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10dduvall) p:05Triage→03Unbreak!
[18:19:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: SSD failure (/dev/sd3) for aqs1013.eqiad.wmnet - https://phabricator.wikimedia.org/T354200 (10Jclark-ctr) 05Open→03Resolved
[18:19:53] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10dduvall) This is a blocker until the host is removed from the dsh targets.
[18:19:58] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "this made some changes to files in /etc/ferm/conf.d such as there are no more seperate files for the IPv6 version of a rule. before we res" [puppet] - 10https://gerrit.wikimedia.org/r/984800 (owner: 10Muehlenhoff)
[18:20:42] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] doc: Switch rsync services to use firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/984800 (owner: 10Muehlenhoff)
[18:24:22] <jinxer-wm>	 (MDRAIDNotEnoughDisks) firing: (2) MD RAID - insufficient active disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDNotEnoughDisks
[18:26:39] <icinga-wm>	 PROBLEM - BGP status on ssw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - No response from remote host 10.65.2.143 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:26:59] <icinga-wm>	 PROBLEM - Host mw2394 is DOWN: PING CRITICAL - Packet loss = 100%
[18:27:13] <wikibugs>	 10SRE, 10ops-eqiad, 10observability: InterfaceSpeedError - https://phabricator.wikimedia.org/T351862 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Replaced Cable
[18:29:18] <mutante>	 !log confctl select 'name=mw2394.codfw.wmnet' set/pooled=inactive | T354193#9430654 - seems like 2396 was previously depooled instead of this 2394
[18:29:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:29:22] <stashbot>	 T354193: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193
[18:30:49] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Dzahn) depooled 2394 - per https://sal.toolforge.org/log/vbyWyowBxE1_1c7szGCe previously 2396 was depooled
[18:32:40] <dduvall>	 mutante: i'm a bit confused atm. seems 2394 was removed from the mediawiki-installation dsh group but both 2394 and 2396 are still present in scap_targets which is only used for scap installation and upgrades
[18:32:56] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:33:03] <mutante>	 dduvall: I dont have any information besides being asked to depool the broken host
[18:33:08] <mutante>	 at this point
[18:33:23] <dduvall>	 k. i think we're ok for train. i'll deescalate the task and remove as a blocker
[18:33:33] <dduvall>	 thanks for looking into it <3
[18:33:34] <mutante>	 I was about to move it from UBN to High, ok?
[18:33:44] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:33:51] <mutante>	 I ran puppet on deployment hosts to see if that edits the dsh group
[18:33:54] <mutante>	 it did not
[18:34:10] <mutante>	 yes, because that is only for scap deployments then
[18:34:20] <mutante>	 so agreed, train unblocked
[18:35:15] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10dduvall) Thanks, @Dzahn. After looking a bit more, I don't think the presence in `scap_targets` should affect train, so I'm deescalating this. Whether or not depooled hosts should still be present in `scap_t...
[18:35:27] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10dduvall) p:05Unbreak!→03Medium
[18:35:42] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10dduvall)
[18:35:47] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+1] Update virtual domain for url shortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987134 (owner: 10Ladsgroup)
[18:36:23] <dduvall>	 sorry for the confusion!
[18:36:45] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Dzahn) p:05Medium→03High I agree the train should be unblocked and lowering it from UBN to High seems correct.  Also that scap_targets should only influence scap deployment.  edit: well, High or Medium :)
[18:36:49] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Dzahn) p:05High→03Medium
[18:37:41] <thcipriani>	 seems like scap_deployment is generated from a puppetdb query about what uhosts use mediawiki::scap or scap::target. Doesn't seem to heed repools/depools.
[18:39:27] <mutante>	 sounds about right
[18:50:23] <wikibugs>	 10SRE, 10conftool: conftool no longer automatically !logs changes - https://phabricator.wikimedia.org/T354209 (10taavi)
[18:50:30] <wikibugs>	 10SRE, 10conftool: conftool no longer automatically !logs changes - https://phabricator.wikimedia.org/T354209 (10taavi) a:03taavi
[18:51:56] <wikibugs>	 (03CR) 10Krinkle: "I'm curious what led to this patch? Is it about wanting to be auto-logged in when first visiting foundationwiki from another wiki?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987138 (https://phabricator.wikimedia.org/T205347) (owner: 10ArielGlenn)
[18:53:43] <wikibugs>	 (03PS1) 10Majavah: cli: Fix IRC logging [software/conftool] - 10https://gerrit.wikimedia.org/r/987167 (https://phabricator.wikimedia.org/T354209)
[18:57:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cli: Fix IRC logging [software/conftool] - 10https://gerrit.wikimedia.org/r/987167 (https://phabricator.wikimedia.org/T354209) (owner: 10Majavah)
[18:57:11] <wikibugs>	 (03PS2) 10Majavah: cli: Fix IRC logging [software/conftool] - 10https://gerrit.wikimedia.org/r/987167 (https://phabricator.wikimedia.org/T354209)
[18:57:57] <wikibugs>	 (03PS1) 10Majavah: tox: show black diff on failure [software/conftool] - 10https://gerrit.wikimedia.org/r/987170
[19:00:05] <jouncebot>	 dduvall and dancy: gettimeofday() says it's time for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240102T1900)
[19:01:31] <dancy>	 o/
[19:05:22] <dduvall>	 dancy: o/
[19:05:43] <dduvall>	 Jdlrobson: thoughts about rolling train with https://phabricator.wikimedia.org/T353850 outstanding?
[19:30:46] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987175 (https://phabricator.wikimedia.org/T350088)
[19:30:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987175 (https://phabricator.wikimedia.org/T350088) (owner: 10TrainBranchBot)
[19:31:46] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987175 (https://phabricator.wikimedia.org/T350088) (owner: 10TrainBranchBot)
[19:38:58] <logmsgbot>	 !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.12  refs T350088
[19:39:07] <stashbot>	 T350088: 1.42.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T350088
[19:53:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2063:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:58:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: (4) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2063:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:01:50] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Make cuminx002 warning more-visible [puppet] - 10https://gerrit.wikimedia.org/r/987156 (https://phabricator.wikimedia.org/T353419) (owner: 10BBlack)
[20:06:06] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984811 (owner: 10Muehlenhoff)
[20:14:22] <jinxer-wm>	 (MDRAIDNotEnoughDisks) resolved: (2) MD RAID - insufficient active disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDNotEnoughDisks
[20:14:34] <jinxer-wm>	 (CirrusSearchJobQueueLagTooHigh) firing: CirrusSearch job cirrusSearchLinksUpdate lag is too high: 8h 16m 2s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh
[20:18:00] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C: 03+1] "Feel free to schedule the change for one of the available backport windows: https://wikitech.wikimedia.org/wiki/Deployments (or ask someon" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985647 (https://phabricator.wikimedia.org/T354013) (owner: 10Houseblaster)
[20:24:03] <wikibugs>	 10SRE, 10ops-codfw: Inbound interface errors - ge-6/0/22 - db2099 - https://phabricator.wikimedia.org/T354155 (10Dzahn)
[20:27:07] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:28:54] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:29:34] <jinxer-wm>	 (CirrusSearchJobQueueLagTooHigh) firing: (2) CirrusSearch job cirrusSearchElasticaWrite lag is too high: 9h 30m 31s - TODO  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh
[20:29:59] <wikibugs>	 (03PS1) 10Andrew Bogott: disable_tool: remove the archive_db stage from the cron host [puppet] - 10https://gerrit.wikimedia.org/r/987187 (https://phabricator.wikimedia.org/T353642)
[20:30:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] disable_tool: remove the archive_db stage from the cron host [puppet] - 10https://gerrit.wikimedia.org/r/987187 (https://phabricator.wikimedia.org/T353642) (owner: 10Andrew Bogott)
[20:30:38] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "note these are rules that allow syncing but there is no automatic syncing set up. I manually run a sync of the home dirs from /srv/homes i" [puppet] - 10https://gerrit.wikimedia.org/r/984811 (owner: 10Muehlenhoff)
[20:31:48] <wikibugs>	 (03PS2) 10Andrew Bogott: disable_tool: remove the archive_db stage from the cron host [puppet] - 10https://gerrit.wikimedia.org/r/987187 (https://phabricator.wikimedia.org/T353642)
[20:32:28] <mutante>	 !log phab2002 - synced /srv/homes tfrom phab1004 to /srv/homes on phab2002
[20:32:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:41] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: prepare new hosts [puppet] - 10https://gerrit.wikimedia.org/r/987188
[20:33:29] <wikibugs>	 (03PS2) 10Ryan Kemper: elastic: prepare new hosts [puppet] - 10https://gerrit.wikimedia.org/r/987188 (https://phabricator.wikimedia.org/T353878)
[20:34:39] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] "Forgot to publish comments" [puppet] - 10https://gerrit.wikimedia.org/r/980914 (https://phabricator.wikimedia.org/T350106) (owner: 10Ryan Kemper)
[20:36:00] <wikibugs>	 (03PS3) 10Ryan Kemper: elastic: prepare new hosts [puppet] - 10https://gerrit.wikimedia.org/r/987188 (https://phabricator.wikimedia.org/T353878)
[20:37:05] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: test out elastic2087 puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/987189
[20:37:55] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:39:19] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.310 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:44:34] <jinxer-wm>	 (CirrusSearchJobQueueLagTooHigh) firing: (2) CirrusSearch job cirrusSearchElasticaWrite lag is too high: 6h 55m 33s - TODO  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh
[20:49:34] <jinxer-wm>	 (CirrusSearchJobQueueLagTooHigh) resolved: (2) CirrusSearch job cirrusSearchElasticaWrite lag is too high: 6h 55m 33s - TODO  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh
[20:52:00] <urbanecm>	 !log mwmaint2002: `mwscript extensions/GrowthExperiments/maintenance/reassignMentees.php --wiki=enwiki --mentor 'FormalDude' --performer 'Martin Urbanec (WMF)'` (T354220)
[20:52:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:52:04] <stashbot>	 T354220: User:FormalDude quit from mentorship, but their mentees were not reassigned - https://phabricator.wikimedia.org/T354220
[20:58:51] <wikibugs>	 (03CR) 10Gergő Tisza: add foundationwiki to the list of central auth login wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987138 (https://phabricator.wikimedia.org/T205347) (owner: 10ArielGlenn)
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240102T2100).
[21:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[21:02:41] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/983955/1004/phab2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/983955 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn)
[21:06:02] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "noop confirmed on phab1004" [puppet] - 10https://gerrit.wikimedia.org/r/983955 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn)
[21:08:43] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2087.codfw.wmnet with OS bullseye
[21:09:33] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "among the things this still did was to create the vcs systemuser, change permissions for scripts under /srv/phab/phabricator/scripts/ssh/ " [puppet] - 10https://gerrit.wikimedia.org/r/983955 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn)
[21:17:36] <wikibugs>	 (03CR) 10Bking: [C: 03+1] elastic: test out elastic2087 puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/987189 (owner: 10Ryan Kemper)
[21:22:40] <wikibugs>	 10SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10Ottomata) +1 k!
[21:28:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:33:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:58:02] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] designate nova_fixed_multi: create A recs using project_id and project_name [puppet] - 10https://gerrit.wikimedia.org/r/957371 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott)
[22:25:57] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:29:28] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2087.codfw.wmnet with OS bullseye
[22:34:20] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "Can you also share your plans for deployment and testing? I am not entirely sure what needs to be done." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata)
[22:36:18] <wikibugs>	 10SRE, 10MW-on-K8s, 10WMF-JobQueue, 10serviceops: Moving jobs to MW-on-k8s decreased their timeout from 1200s to 200s - https://phabricator.wikimedia.org/T354229 (10Urbanecm_WMF)
[22:37:15] <wikibugs>	 10SRE, 10MW-on-K8s, 10WMF-JobQueue, 10serviceops: Moving jobs to MW-on-k8s decreased their timeout from 1200s to 200s - https://phabricator.wikimedia.org/T354229 (10Urbanecm_WMF)
[22:41:18] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796 (10Urbanecm_WMF) I think the k8s migration work as part of this ticket caused {T354229}.
[22:42:38] <urbanecm>	 !log mwmaint2002: Restart `mwscript extensions/GrowthExperiments/maintenance/reassignMentees.php --wiki=enwiki --mentor 'FormalDude' --performer 'Martin Urbanec (WMF)'` (T354220)
[22:42:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:42:41] <stashbot>	 T354220: User:FormalDude quit from mentorship, but their mentees were not reassigned - https://phabricator.wikimedia.org/T354220
[22:59:48] <wikibugs>	 10SRE, 10MW-on-K8s, 10WMF-JobQueue, 10serviceops: Moving jobs to MW-on-k8s decreased their timeout from 1200s to 200s - https://phabricator.wikimedia.org/T354229 (10Urbanecm_WMF)
[23:17:39] <wikibugs>	 (03CR) 10Gergő Tisza: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata)
[23:24:27] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[23:30:42] <wikibugs>	 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown: Varnish: Mobile site redirect interferes with OAuth authorization process - https://phabricator.wikimedia.org/T74186 (10Tgr)
[23:32:05] <wikibugs>	 (03PS1) 10Ebernhardson: team-search-platform: Update job queue alerts to use histogram [alerts] - 10https://gerrit.wikimedia.org/r/987206
[23:34:20] <wikibugs>	 (03PS2) 10Ebernhardson: team-search-platform: Update job queue alerts to use histogram [alerts] - 10https://gerrit.wikimedia.org/r/987206
[23:50:43] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 126, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:51:29] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down