[00:18:33] (KubernetesRsyslogDown) firing: (3) rsyslog on mw1380:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:27:12] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:29:03] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:29:15] (03PS3) 10EpicPupper: planet: add various feeds, reorganize [puppet] - 10https://gerrit.wikimedia.org/r/988001 [00:30:32] PROBLEM - MD RAID on aqs1013 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:30:33] ACKNOWLEDGEMENT - MD RAID on aqs1013 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T354499 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:30:37] 10SRE, 10ops-eqiad: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T354499 (10ops-monitoring-bot) [00:32:12] (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:38:22] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [00:38:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/988227 [00:38:50] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/988227 (owner: 10TrainBranchBot) [00:56:41] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/988227 (owner: 10TrainBranchBot) [01:25:27] (KubernetesAPINotScrapable) firing: (2) k8s-staging@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [01:39:17] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2083:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:00:22] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:37:12] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:09:03] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:04:20] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-k8s-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:18:33] (KubernetesRsyslogDown) firing: (3) rsyslog on mw1380:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:38:22] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [04:39:32] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:44:14] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.294 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:00:36] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:01:12] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:02:04] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:10:00] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:10:10] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.298 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:10:46] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51306 bytes in 0.170 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:25:27] (KubernetesAPINotScrapable) firing: (2) k8s-staging@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [05:39:17] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2083:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:41:16] PROBLEM - Check systemd state on ms-be2069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:59:26] (03CR) 10Santhosh: [C: 03+2] Fix Special:ExternalGuidance [extensions/ExternalGuidance] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987994 (https://phabricator.wikimedia.org/T354404) (owner: 10Jdlrobson) [06:04:31] (03Merged) 10jenkins-bot: Fix Special:ExternalGuidance [extensions/ExternalGuidance] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987994 (https://phabricator.wikimedia.org/T354404) (owner: 10Jdlrobson) [06:25:10] (03PS1) 10Marostegui: installserver: Do not reimage db1246 [puppet] - 10https://gerrit.wikimedia.org/r/988321 [06:31:01] (03CR) 10Marostegui: [C: 03+2] installserver: Do not reimage db1246 [puppet] - 10https://gerrit.wikimedia.org/r/988321 (owner: 10Marostegui) [06:31:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:35:01] (03PS1) 10Marostegui: production-m5.sql.erb: Remove DROP, INDEX, ALTER [puppet] - 10https://gerrit.wikimedia.org/r/988336 (https://phabricator.wikimedia.org/T351189) [06:39:30] (03CR) 10Marostegui: [C: 03+2] production-m5.sql.erb: Remove DROP, INDEX, ALTER [puppet] - 10https://gerrit.wikimedia.org/r/988336 (https://phabricator.wikimedia.org/T351189) (owner: 10Marostegui) [07:01:40] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) [07:21:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:51:39] (03CR) 10D3r1ck01: wmf-config: Remove unused wgStatsCacheType setting (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974508 (https://phabricator.wikimedia.org/T336004) (owner: 10D3r1ck01) [07:59:34] (03PS4) 10Muehlenhoff: aptrepo: add Elastic-related components to bookworm repo [puppet] - 10https://gerrit.wikimedia.org/r/987859 (https://phabricator.wikimedia.org/T353392) (owner: 10Bking) [08:00:05] Amir1 and Urbanecm: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240108T0800) [08:00:05] xSavitar: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:26] o/ [08:02:08] I can deploy if no deployer is around at this time. [08:02:39] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1037/co" [puppet] - 10https://gerrit.wikimedia.org/r/983376 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:03:29] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:puppetmaster::monitoring Prometheus stats for Puppetmerge. [puppet] - 10https://gerrit.wikimedia.org/r/983376 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:04:59] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/987859 (https://phabricator.wikimedia.org/T353392) (owner: 10Bking) [08:05:34] (03CR) 10Muehlenhoff: clamav: add systemd override, enable restart on-failure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/988081 (owner: 10Dzahn) [08:09:36] PROBLEM - Check systemd state on puppetmaster1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppetmerge_puppet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:10:18] (03PS1) 10Slyngshede: C:puppetmaster::monitoring disable timers [puppet] - 10https://gerrit.wikimedia.org/r/988354 (https://phabricator.wikimedia.org/T350694) [08:10:48] PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppetmerge_labs_private.service,prometheus_puppetmerge_puppet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:11:00] PROBLEM - Check systemd state on puppetmaster2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppetmerge_puppet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:12:13] The failing service on the puppetmaster is me, I'm just disabling those services and fixing the script they run [08:12:38] PROBLEM - Check systemd state on puppetmaster1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppetmerge_puppet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:12:41] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1038/co" [puppet] - 10https://gerrit.wikimedia.org/r/988354 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:12:47] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:puppetmaster::monitoring disable timers [puppet] - 10https://gerrit.wikimedia.org/r/988354 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:12:54] (03CR) 10D3r1ck01: [C: 03+2] wmf-config: Remove unused wgStatsCacheType setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974508 (https://phabricator.wikimedia.org/T336004) (owner: 10D3r1ck01) [08:14:10] (03Merged) 10jenkins-bot: wmf-config: Remove unused wgStatsCacheType setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974508 (https://phabricator.wikimedia.org/T336004) (owner: 10D3r1ck01) [08:17:13] !log derick@deploy2002 Started scap: Backport for [[gerrit:974508|wmf-config: Remove unused wgStatsCacheType setting (T336004)]] [08:17:17] T336004: Recognize 4th cache service interface in MediaWiki (Migrate ConfirmEdit tokens from MainStash to mcrouter-primary-dc) - https://phabricator.wikimedia.org/T336004 [08:18:33] (KubernetesRsyslogDown) firing: (3) rsyslog on mw1380:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:18:44] !log derick@deploy2002 derick and d3r1ck01: Backport for [[gerrit:974508|wmf-config: Remove unused wgStatsCacheType setting (T336004)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:19:47] (03CR) 10Muehlenhoff: puppet: add quota module to vendor_modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987491 (https://phabricator.wikimedia.org/T343364) (owner: 10Dzahn) [08:19:59] !log derick@deploy2002 derick and d3r1ck01: Continuing with sync [08:25:24] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10MoritzMuehlenhoff) >>! In T300152#9437438, @ayounsi wrote: > On naming I didn't use `private1-ganeti-codfw` as I didn't want to tie the IPs to a specific tool. On the ot... [08:26:24] !log derick@deploy2002 Finished scap: Backport for [[gerrit:974508|wmf-config: Remove unused wgStatsCacheType setting (T336004)]] (duration: 09m 11s) [08:26:28] T336004: Recognize 4th cache service interface in MediaWiki (Migrate ConfirmEdit tokens from MainStash to mcrouter-primary-dc) - https://phabricator.wikimedia.org/T336004 [08:27:33] !log UTC morning backport window done. [08:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:47] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10MoritzMuehlenhoff) >>! In T352974#9437823, @jbond wrote: >>>! In T352974#9392688, @ABran-WMF wrote: >> it appears that most of our hosts are sti... [08:28:53] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) >>! In T352974#9440440, @MoritzMuehlenhoff wrote: >>>! In T352974#9437823, @jbond wrote: >>>>! In T352974#9392688, @ABran-WMF wrote:... [08:29:23] (03PS1) 10Slyngshede: C:puppetmaster::monitoring Reenable Prometheus data collection. [puppet] - 10https://gerrit.wikimedia.org/r/988398 (https://phabricator.wikimedia.org/T350694) [08:32:53] (03PS1) 10Muehlenhoff: Update associated email address for dreamyjazz [puppet] - 10https://gerrit.wikimedia.org/r/988399 [08:33:50] (03PS2) 10Muehlenhoff: Update associated email address for dreamyjazz [puppet] - 10https://gerrit.wikimedia.org/r/988399 (https://phabricator.wikimedia.org/T353735) [08:34:54] (03CR) 10Slyngshede: [C: 03+2] Add warning for OOM killer. [alerts] - 10https://gerrit.wikimedia.org/r/987398 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:36:38] (03Merged) 10jenkins-bot: Add warning for OOM killer. [alerts] - 10https://gerrit.wikimedia.org/r/987398 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:36:45] (03CR) 10Muehlenhoff: [C: 03+2] Update associated email address for dreamyjazz [puppet] - 10https://gerrit.wikimedia.org/r/988399 (https://phabricator.wikimedia.org/T353735) (owner: 10Muehlenhoff) [08:38:37] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [08:39:42] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for WBrown (WMF) - https://phabricator.wikimedia.org/T353735 (10MoritzMuehlenhoff) @andrea.denisse : Since this is a staff account, it should only be in the cn=wmf group, but not also in cn=nda (the latter is only for people ha... [08:42:18] (03PS1) 10Ayounsi: Depool eqsin for switch upgrade [dns] - 10https://gerrit.wikimedia.org/r/988400 (https://phabricator.wikimedia.org/T332395) [08:44:18] (03CR) 10Slyngshede: [C: 03+2] C:puppetmaster::monitoring Reenable Prometheus data collection. [puppet] - 10https://gerrit.wikimedia.org/r/988398 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:51:12] PROBLEM - Check systemd state on puppetmaster2003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppetmerge_puppet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:42] (03PS1) 10Slyngshede: C:puppetmaster::monitoring missing execute bit on script. [puppet] - 10https://gerrit.wikimedia.org/r/988402 (https://phabricator.wikimedia.org/T350694) [08:53:07] (03CR) 10Slyngshede: [C: 03+2] C:puppetmaster::monitoring missing execute bit on script. [puppet] - 10https://gerrit.wikimedia.org/r/988402 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:57:14] (03PS1) 10D3r1ck01: wmf-config: Remove unused wgCentralAuthTokenCacheType [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988403 (https://phabricator.wikimedia.org/T336004) [08:57:40] RECOVERY - Check systemd state on puppetmaster2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:58:41] (03PS1) 10Slyngshede: C:puppetmaster::monitoring misspelled Prometheus. [puppet] - 10https://gerrit.wikimedia.org/r/988404 [08:59:19] Computers are complicated and I kinda want to go back to bed [08:59:28] welcome to monday slyngs [08:59:42] Thanks :-) [09:00:10] (03CR) 10Slyngshede: [C: 03+2] C:puppetmaster::monitoring misspelled Prometheus. [puppet] - 10https://gerrit.wikimedia.org/r/988404 (owner: 10Slyngshede) [09:01:33] (03CR) 10Ayounsi: [C: 03+2] Depool eqsin for switch upgrade [dns] - 10https://gerrit.wikimedia.org/r/988400 (https://phabricator.wikimedia.org/T332395) (owner: 10Ayounsi) [09:02:58] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppetmerge_labs_private.service,prometheus_puppetmerge_puppet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:03:15] !log depool eqsin for switch upgrade - T332395 [09:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:21] T332395: Upgrade asw1-eqsin - https://phabricator.wikimedia.org/T332395 [09:04:45] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on 35 hosts with reason: eqsin switch upgrade [09:05:16] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 35 hosts with reason: eqsin switch upgrade [09:05:28] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade asw1-eqsin - https://phabricator.wikimedia.org/T332395 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6bec1528-7372-478d-856a-a08325eb04f0) set by ayounsi@cumin1002 for 2:00:00 on 35 host(s) and their services w... [09:06:09] RECOVERY - Check systemd state on puppetmaster1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:07:06] (03CR) 10Muehlenhoff: [C: 03+2] Switch netmon to nftables [puppet] - 10https://gerrit.wikimedia.org/r/987945 (owner: 10Muehlenhoff) [09:07:30] RECOVERY - Check systemd state on puppetmaster2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:10:38] RECOVERY - Check systemd state on puppetmaster1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:03] 10SRE-Access-Requests, 10Machine-Learning-Team: Requesting - https://phabricator.wikimedia.org/T354516 (10isarantopoulos) [09:23:17] (03PS1) 10Jelto: vrts: auto-restart apache2 on-failure [puppet] - 10https://gerrit.wikimedia.org/r/988410 (https://phabricator.wikimedia.org/T354478) [09:24:46] !log start install process on asw1-eqsin - T332395 [09:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:52] T332395: Upgrade asw1-eqsin - https://phabricator.wikimedia.org/T332395 [09:25:27] (KubernetesAPINotScrapable) firing: (2) k8s-staging@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [09:27:48] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: validate check Prometheus instance [puppet] - 10https://gerrit.wikimedia.org/r/987789 (owner: 10Filippo Giunchedi) [09:28:27] (03CR) 10Jelto: [C: 04-1] clamav: add systemd override, enable restart on-failure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/988081 (owner: 10Dzahn) [09:30:11] (KubernetesAPINotScrapable) resolved: (2) k8s-staging@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [09:32:05] !log reboot ms-be1072-82 before adding them to the rings T353149 [09:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:10] T353149: Q3 ms backend refresh work - https://phabricator.wikimedia.org/T353149 [09:32:37] !log reboot ms-be2074-80 before adding them to the rings T353149 [09:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:54] PROBLEM - Host ms-be1080 is DOWN: PING CRITICAL - Packet loss = 100% [09:33:56] PROBLEM - Host ms-be1079 is DOWN: PING CRITICAL - Packet loss = 100% [09:33:58] PROBLEM - Host ms-be1076 is DOWN: PING CRITICAL - Packet loss = 100% [09:34:00] PROBLEM - Host ms-be1082 is DOWN: PING CRITICAL - Packet loss = 100% [09:34:00] PROBLEM - Host ms-be1081 is DOWN: PING CRITICAL - Packet loss = 100% [09:34:06] PROBLEM - Host ms-be1078 is DOWN: PING CRITICAL - Packet loss = 100% [09:34:08] PROBLEM - Host ms-be1077 is DOWN: PING CRITICAL - Packet loss = 100% [09:34:48] PROBLEM - Host ms-be2077 is DOWN: PING CRITICAL - Packet loss = 100% [09:34:48] PROBLEM - Host ms-be2078 is DOWN: PING CRITICAL - Packet loss = 100% [09:34:56] PROBLEM - Host ms-be2076 is DOWN: PING CRITICAL - Packet loss = 100% [09:34:56] PROBLEM - Host ms-be2079 is DOWN: PING CRITICAL - Packet loss = 100% [09:35:02] PROBLEM - Host ms-be2075 is DOWN: PING CRITICAL - Packet loss = 100% [09:35:02] PROBLEM - Host ms-be2074 is DOWN: PING CRITICAL - Packet loss = 100% [09:35:02] PROBLEM - Host ms-be2080 is DOWN: PING CRITICAL - Packet loss = 100% [09:35:06] eep [09:35:08] RECOVERY - Host ms-be1077 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [09:35:10] RECOVERY - Host ms-be1076 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [09:35:10] RECOVERY - Host ms-be1079 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [09:35:10] RECOVERY - Host ms-be1082 is UP: PING OK - Packet loss = 0%, RTA = 3.80 ms [09:35:16] RECOVERY - Host ms-be1080 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [09:35:32] RECOVERY - Host ms-be1081 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [09:35:38] RECOVERY - Host ms-be1078 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [09:35:54] RECOVERY - Host ms-be2074 is UP: PING OK - Packet loss = 0%, RTA = 30.27 ms [09:35:56] RECOVERY - Host ms-be2076 is UP: PING OK - Packet loss = 0%, RTA = 32.64 ms [09:36:00] RECOVERY - Host ms-be2080 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms [09:36:20] RECOVERY - Host ms-be2077 is UP: PING OK - Packet loss = 0%, RTA = 30.34 ms [09:36:30] RECOVERY - Host ms-be2079 is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms [09:36:32] RECOVERY - Host ms-be2075 is UP: PING OK - Packet loss = 0%, RTA = 30.34 ms [09:36:38] RECOVERY - Host ms-be2078 is UP: PING OK - Packet loss = 0%, RTA = 30.38 ms [09:37:18] (03PS8) 10Jelto: clamav: add systemd override, enable restart on-failure [puppet] - 10https://gerrit.wikimedia.org/r/988081 (owner: 10Dzahn) [09:37:18] PROBLEM - Check systemd state on ms-be1077 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:37:46] PROBLEM - Check systemd state on ms-be2080 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:37:50] 10SRE, 10SRE-Access-Requests, 10Machine-Learning-Team: Requesting write access to ml-staging-codfw for ML team - https://phabricator.wikimedia.org/T354516 (10isarantopoulos) [09:38:14] (03CR) 10MVernon: [C: 03+2] swift: add new nodes, drain old nodes from the rings [puppet] - 10https://gerrit.wikimedia.org/r/987718 (https://phabricator.wikimedia.org/T353149) (owner: 10MVernon) [09:38:22] RECOVERY - Check systemd state on ms-be1077 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:38:42] (03PS2) 10Jelto: vrts: auto-restart apache2 on-failure [puppet] - 10https://gerrit.wikimedia.org/r/988410 (https://phabricator.wikimedia.org/T354478) [09:38:50] RECOVERY - Check systemd state on ms-be2080 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:39:18] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2083:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:41:18] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1040/co" [puppet] - 10https://gerrit.wikimedia.org/r/988410 (https://phabricator.wikimedia.org/T354478) (owner: 10Jelto) [09:41:51] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/988081 (owner: 10Dzahn) [09:45:01] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/988410 (https://phabricator.wikimedia.org/T354478) (owner: 10Jelto) [09:54:53] !log asw1-eqsin> request system reboot - T332395 [09:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:57] T332395: Upgrade asw1-eqsin - https://phabricator.wikimedia.org/T332395 [09:56:54] (03CR) 10Majavah: [C: 03+2] admin: allow using security key backed SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/981418 (owner: 10Majavah) [09:59:12] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 10 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:59:37] jouncebot: nowandnext [09:59:37] No deployments scheduled for the next 1 hour(s) and 0 minute(s) [09:59:37] In 1 hour(s) and 0 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240108T1100) [09:59:46] PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 10 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:59:57] (03PS2) 10Ladsgroup: Set commonswiki pagelinks migration stage to READ NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987657 (https://phabricator.wikimedia.org/T351237) [10:00:01] (03CR) 10Ladsgroup: [C: 03+2] Set commonswiki pagelinks migration stage to READ NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987657 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [10:00:07] (03CR) 10Majavah: [C: 03+2] admin: add security key based keys for taavi [puppet] - 10https://gerrit.wikimedia.org/r/983430 (owner: 10Majavah) [10:00:20] PROBLEM - VRRP status on cr3-eqsin is CRITICAL: VRRP CRITICAL - 3 inconsistent interfaces, 0 misconfigured interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [10:00:21] (03CR) 10Aklapper: "Thanks for the patch! Just for the records I won't be able to look into this until late March earliest. I hope others find time :-/" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694) (owner: 10Pppery) [10:00:26] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 70, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:00:34] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 59, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:00:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987657 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [10:02:37] (03Merged) 10jenkins-bot: Set commonswiki pagelinks migration stage to READ NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987657 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [10:02:52] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:987657|Set commonswiki pagelinks migration stage to READ NEW (T351237)]] [10:02:56] T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237 [10:04:17] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:987657|Set commonswiki pagelinks migration stage to READ NEW (T351237)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:05:32] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [10:06:16] RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:06:24] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:debmonitor::server Add Prometheus Blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/983108 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:06:50] RECOVERY - VRRP status on cr3-eqsin is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [10:06:56] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 82, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:07:04] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:07:15] (ProbeDown) firing: (14) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:07:18] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:08:18] (03CR) 10Klausman: [V: 03+2 C: 03+2] admin_ng: force coredns to resolve to A records in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/984250 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [10:09:02] (ProbeDown) resolved: (14) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:11:45] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:987657|Set commonswiki pagelinks migration stage to READ NEW (T351237)]] (duration: 08m 52s) [10:11:49] T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237 [10:12:54] (03CR) 10Ladsgroup: [C: 03+2] styles: Replace obsolete WikimediaUI Base var with Codex alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987861 (owner: 10VolkerE) [10:13:46] (03Merged) 10jenkins-bot: styles: Replace obsolete WikimediaUI Base var with Codex alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987861 (owner: 10VolkerE) [10:14:15] (03CR) 10Klausman: [C: 03+2] admin_ng: set new Istio Service Entry for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/984214 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [10:14:21] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:987861|styles: Replace obsolete WikimediaUI Base var with Codex alias]] [10:15:12] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:15:44] !log ladsgroup@deploy2002 volker-e and ladsgroup: Backport for [[gerrit:987861|styles: Replace obsolete WikimediaUI Base var with Codex alias]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:15:52] !log ladsgroup@deploy2002 volker-e and ladsgroup: Continuing with sync [10:17:09] (03Merged) 10jenkins-bot: admin_ng: set new Istio Service Entry for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/984214 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [10:17:12] (03Merged) 10jenkins-bot: admin_ng: force coredns to resolve to A records in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/984250 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [10:20:09] (03PS1) 10Ayounsi: Repool eqsin for switch upgrade [dns] - 10https://gerrit.wikimedia.org/r/988248 (https://phabricator.wikimedia.org/T332395) [10:20:36] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [10:20:58] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [10:21:45] (03PS1) 10Ayounsi: Enable mgmt_junos on asw1-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/988416 (https://phabricator.wikimedia.org/T332395) [10:21:53] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:987861|styles: Replace obsolete WikimediaUI Base var with Codex alias]] (duration: 07m 32s) [10:22:15] (ProbeDown) firing: (18) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:23:52] (03PS1) 10Slyngshede: C:puppetmaster::monitoring labs/private uses master branch. [puppet] - 10https://gerrit.wikimedia.org/r/988417 (https://phabricator.wikimedia.org/T350694) [10:25:31] 10SRE, 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10Gehel) [10:26:10] (03CR) 10Slyngshede: [C: 03+2] C:puppetmaster::monitoring labs/private uses master branch. [puppet] - 10https://gerrit.wikimedia.org/r/988417 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:26:21] (03CR) 10Ayounsi: [C: 03+2] Revert "Disable Telemetry on eqsin switches" [homer/public] - 10https://gerrit.wikimedia.org/r/987741 (https://phabricator.wikimedia.org/T332395) (owner: 10Ayounsi) [10:26:57] (03Merged) 10jenkins-bot: Revert "Disable Telemetry on eqsin switches" [homer/public] - 10https://gerrit.wikimedia.org/r/987741 (https://phabricator.wikimedia.org/T332395) (owner: 10Ayounsi) [10:28:52] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:30:00] RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:30:26] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/986838 (owner: 10PipelineBot) [10:30:44] (03CR) 10Ayounsi: [C: 03+2] Enable mgmt_junos on asw1-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/988416 (https://phabricator.wikimedia.org/T332395) (owner: 10Ayounsi) [10:31:20] (03Merged) 10jenkins-bot: Enable mgmt_junos on asw1-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/988416 (https://phabricator.wikimedia.org/T332395) (owner: 10Ayounsi) [10:31:30] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/986838 (owner: 10PipelineBot) [10:32:28] (03CR) 10Majavah: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/987781 (owner: 10Muehlenhoff) [10:32:59] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [10:33:39] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:34:49] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1041/console" [puppet] - 10https://gerrit.wikimedia.org/r/977598 (owner: 10Majavah) [10:35:59] (03CR) 10Ayounsi: [C: 03+2] Repool eqsin for switch upgrade [dns] - 10https://gerrit.wikimedia.org/r/988248 (https://phabricator.wikimedia.org/T332395) (owner: 10Ayounsi) [10:36:36] !log repool eqsin - T332395 [10:36:39] (03PS1) 10Peter Fischer: enable page_rerender for 3rd batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988449 [10:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:40] T332395: Upgrade asw1-eqsin - https://phabricator.wikimedia.org/T332395 [10:37:18] (03CR) 10Peter Fischer: "I'll schedule this for the afternoon backport window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988449 (owner: 10Peter Fischer) [10:38:31] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10Language-Team (Language-2024-January-March): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491 (10Pginer-WMF) [10:38:36] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/987488 (owner: 10Dzahn) [10:41:32] (03PS4) 10Ladsgroup: snapshot: Improve border of dumps cards [puppet] - 10https://gerrit.wikimedia.org/r/986181 [10:41:35] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] snapshot: Improve border of dumps cards [puppet] - 10https://gerrit.wikimedia.org/r/986181 (owner: 10Ladsgroup) [10:41:38] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10Bengali-Sites, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10SCherukuwada) Here is a summary of our discussions with Google (they proofread this summary): The web is really large and the s... [10:46:59] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade asw1-eqsin - https://phabricator.wikimedia.org/T332395 (10ayounsi) 05Open→03Resolved All done. ~10min downtime. [10:49:02] (ProbeDown) firing: (6) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:58:51] (03PS1) 10Kamila Součková: Add ipoid to the service catalog [puppet] - 10https://gerrit.wikimedia.org/r/988453 [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240108T1100) [11:02:06] (03PS2) 10Kamila Součková: Add ipoid to the service mesh [puppet] - 10https://gerrit.wikimedia.org/r/988453 (https://phabricator.wikimedia.org/T325147) [11:04:00] (03PS1) 10Slyngshede: P:debmonitor::server check host for CDN. [puppet] - 10https://gerrit.wikimedia.org/r/988454 (https://phabricator.wikimedia.org/T350694) [11:08:11] (03PS4) 10Clément Goubert: service.yaml: add iPoid to the service catalogue [puppet] - 10https://gerrit.wikimedia.org/r/928487 (https://phabricator.wikimedia.org/T325147) (owner: 10Effie Mouzeli) [11:09:11] (03PS5) 10Clément Goubert: service.yaml: add iPoid to the service catalogue [puppet] - 10https://gerrit.wikimedia.org/r/928487 (https://phabricator.wikimedia.org/T325147) (owner: 10Effie Mouzeli) [11:09:33] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1042/console" [puppet] - 10https://gerrit.wikimedia.org/r/988454 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [11:09:55] (03CR) 10VolkerE: "[Thanks Ladsgroup! I tried to copy the RTL user name, but Chrome didn't let me.]" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987861 (owner: 10VolkerE) [11:10:35] (03CR) 10Ladsgroup: [C: 03+2] styles: Replace obsolete WikimediaUI Base var with Codex alias (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987861 (owner: 10VolkerE) [11:13:29] (03PS1) 10Ladsgroup: Stop writing to the old columns of pagelinks in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988456 (https://phabricator.wikimedia.org/T352010) [11:14:45] (03PS1) 10Marostegui: db2117: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/988457 (https://phabricator.wikimedia.org/T354506) [11:14:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2117 T354506', diff saved to https://phabricator.wikimedia.org/P54533 and previous config saved to /var/cache/conftool/dbconfig/20240108-111452-root.json [11:14:57] T354506: Upgrade s6 hosts to Bookworm - https://phabricator.wikimedia.org/T354506 [11:15:18] (03CR) 10Ladsgroup: [C: 03+2] Stop writing to the old columns of pagelinks in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988456 (https://phabricator.wikimedia.org/T352010) (owner: 10Ladsgroup) [11:16:02] (03Merged) 10jenkins-bot: Stop writing to the old columns of pagelinks in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988456 (https://phabricator.wikimedia.org/T352010) (owner: 10Ladsgroup) [11:16:12] (03CR) 10Marostegui: [C: 03+2] db2117: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/988457 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui) [11:17:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2117.codfw.wmnet with OS bookworm [11:17:07] (03PS2) 10Slyngshede: P:debmonitor::server check host for CDN. [puppet] - 10https://gerrit.wikimedia.org/r/988454 (https://phabricator.wikimedia.org/T350694) [11:19:25] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:988456|Stop writing to the old columns of pagelinks in testwiki (T352010)]] [11:19:29] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [11:19:37] (03CR) 10Tacsipacsi: "Thanks for preparing the backport! When do you plan to have it deployed? I don’t see it yet in the deployment calendar (https://wikitech.w" [extensions/Gadgets] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987999 (https://phabricator.wikimedia.org/T354385) (owner: 10Krinkle) [11:20:28] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1043/console" [puppet] - 10https://gerrit.wikimedia.org/r/988454 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [11:20:54] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:988456|Stop writing to the old columns of pagelinks in testwiki (T352010)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:20:56] (03CR) 10Clément Goubert: [C: 03+1] Add ipoid to the service mesh [puppet] - 10https://gerrit.wikimedia.org/r/988453 (https://phabricator.wikimedia.org/T325147) (owner: 10Kamila Součková) [11:21:19] (03CR) 10Muehlenhoff: P:openldap: convert to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/982394 (owner: 10Majavah) [11:22:27] (03PS2) 10Aklapper: phabricator weekly changes email: Explain why some queries are listed [puppet] - 10https://gerrit.wikimedia.org/r/987143 [11:22:45] (03PS1) 10Majavah: OATHAuthServices: Fix service name [extensions/OATHAuth] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/988252 (https://phabricator.wikimedia.org/T354505) [11:23:18] (03PS1) 10Majavah: Fix disabling two-factor authentication [extensions/OATHAuth] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/988253 (https://phabricator.wikimedia.org/T354505) [11:23:30] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [11:23:56] jouncebot: nowandnext [11:23:56] For the next 0 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240108T1100) [11:23:56] In 2 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240108T1400) [11:24:43] (03CR) 10Lucas Werkmeister (WMDE): enable page_rerender for 3rd batch of wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988449 (owner: 10Peter Fischer) [11:25:59] (03PS1) 10Ladsgroup: Disable Listings extension everywhere except rowikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988460 (https://phabricator.wikimedia.org/T253216) [11:27:05] Amir1: ping me when done please? I'd like to push out fixes for T354505 [11:27:06] T354505: Disabling two-factor authentication is broken - https://phabricator.wikimedia.org/T354505 [11:27:29] sure, it's almost done [11:28:38] (03CR) 10Hnowlan: [C: 03+1] "One nit, otherwise lgtm as far as chart changes are concerned. Can't comment on what the the image bump might do" [deployment-charts] - 10https://gerrit.wikimedia.org/r/977983 (https://phabricator.wikimedia.org/T344982) (owner: 10KartikMistry) [11:29:28] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:988456|Stop writing to the old columns of pagelinks in testwiki (T352010)]] (duration: 10m 02s) [11:29:32] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [11:30:16] taavi: I'm done, please ping me once you're done, I have another patch to push [11:30:50] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [extensions/OATHAuth] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/988252 (https://phabricator.wikimedia.org/T354505) (owner: 10Majavah) [11:30:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [extensions/OATHAuth] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/988253 (https://phabricator.wikimedia.org/T354505) (owner: 10Majavah) [11:30:55] Amir1: thanks, will do [11:31:20] (03PS2) 10Phuedx: Remove partial migration of EditAttemptStep instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982467 (https://phabricator.wikimedia.org/T351335) (owner: 10Santiago Faci) [11:31:47] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10BTullis) Could it be this reference in [[https://gitlab.wikimedia.org/repos/sre/wmfdb|wmfdb]] that should be updated to `/etc/ssl/certs/wmf-ca-c... [11:35:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2117.codfw.wmnet with reason: host reimage [11:36:12] (03Merged) 10jenkins-bot: OATHAuthServices: Fix service name [extensions/OATHAuth] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/988252 (https://phabricator.wikimedia.org/T354505) (owner: 10Majavah) [11:36:14] (03Merged) 10jenkins-bot: Fix disabling two-factor authentication [extensions/OATHAuth] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/988253 (https://phabricator.wikimedia.org/T354505) (owner: 10Majavah) [11:36:31] !log taavi@deploy2002 Started scap: Backport for [[gerrit:988252|OATHAuthServices: Fix service name (T354505)]], [[gerrit:988253|Fix disabling two-factor authentication (T354505)]] [11:36:35] T354505: Disabling two-factor authentication is broken - https://phabricator.wikimedia.org/T354505 [11:37:14] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) That's a very good point @BTullis. I'd leave this to @ABran-WMF and @MoritzMuehlenhoff. Orchestrator is still an issue though (which... [11:38:02] !log taavi@deploy2002 taavi: Backport for [[gerrit:988252|OATHAuthServices: Fix service name (T354505)]], [[gerrit:988253|Fix disabling two-factor authentication (T354505)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:38:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2117.codfw.wmnet with reason: host reimage [11:39:54] !log taavi@deploy2002 taavi: Continuing with sync [11:41:19] (03PS7) 10KartikMistry: Update cxserver to 2023-12-04-083437-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/977983 (https://phabricator.wikimedia.org/T344982) [11:41:47] (03CR) 10KartikMistry: Update cxserver to 2023-12-04-083437-production (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/977983 (https://phabricator.wikimedia.org/T344982) (owner: 10KartikMistry) [11:42:08] (03CR) 10Phuedx: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [11:45:52] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:988252|OATHAuthServices: Fix service name (T354505)]], [[gerrit:988253|Fix disabling two-factor authentication (T354505)]] (duration: 09m 21s) [11:45:56] T354505: Disabling two-factor authentication is broken - https://phabricator.wikimedia.org/T354505 [11:46:02] Amir1: i'm done [11:46:15] awesome [11:47:10] (03CR) 10Ladsgroup: [C: 03+2] Disable Listings extension everywhere except rowikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988460 (https://phabricator.wikimedia.org/T253216) (owner: 10Ladsgroup) [11:47:18] (03CR) 10EoghanGaffney: [C: 03+1] vrts: auto-restart apache2 on-failure [puppet] - 10https://gerrit.wikimedia.org/r/988410 (https://phabricator.wikimedia.org/T354478) (owner: 10Jelto) [11:47:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988460 (https://phabricator.wikimedia.org/T253216) (owner: 10Ladsgroup) [11:48:01] (03Merged) 10jenkins-bot: Disable Listings extension everywhere except rowikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988460 (https://phabricator.wikimedia.org/T253216) (owner: 10Ladsgroup) [11:48:14] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:988460|Disable Listings extension everywhere except rowikivoyage (T253216)]] [11:48:18] T253216: Undeploy Extension:Listings from Wikimedia Production - https://phabricator.wikimedia.org/T253216 [11:50:01] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:988460|Disable Listings extension everywhere except rowikivoyage (T253216)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:50:57] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [11:53:25] (03PS1) 10Marostegui: Revert "db2117: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/988254 [11:56:58] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:988460|Disable Listings extension everywhere except rowikivoyage (T253216)]] (duration: 08m 43s) [11:57:02] T253216: Undeploy Extension:Listings from Wikimedia Production - https://phabricator.wikimedia.org/T253216 [11:58:13] (03CR) 10VolkerE: "You did well! 😄" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987861 (owner: 10VolkerE) [11:59:15] (03CR) 10Marostegui: [C: 03+2] Revert "db2117: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/988254 (owner: 10Marostegui) [11:59:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54534 and previous config saved to /var/cache/conftool/dbconfig/20240108-115946-root.json [12:00:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2117.codfw.wmnet with OS bookworm [12:00:41] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 9902 [12:01:32] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9902 [12:02:00] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 35847 [12:02:20] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 35847 [12:02:23] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 45287 [12:03:09] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 45287 [12:04:32] (03PS3) 10Jgiannelos: wikifeeds: Use core page HTML in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/987158 [12:05:40] (03PS4) 10Jgiannelos: wikifeeds: Use core page HTML in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/987158 [12:08:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1224 T354506', diff saved to https://phabricator.wikimedia.org/P54535 and previous config saved to /var/cache/conftool/dbconfig/20240108-120759-root.json [12:08:04] T354506: Upgrade s6 hosts to Bookworm - https://phabricator.wikimedia.org/T354506 [12:08:45] (03PS1) 10Jgiannelos: Revert "wikifeeds: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/988255 [12:08:57] (03PS1) 10Marostegui: db1224: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/988468 (https://phabricator.wikimedia.org/T354506) [12:10:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1224.eqiad.wmnet with OS bookworm [12:10:20] (03CR) 10Marostegui: [C: 03+2] db1224: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/988468 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui) [12:10:25] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:10:55] (03CR) 10Jgiannelos: [C: 03+2] Revert "wikifeeds: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/988255 (owner: 10Jgiannelos) [12:12:11] (03Merged) 10jenkins-bot: Revert "wikifeeds: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/988255 (owner: 10Jgiannelos) [12:12:59] (03CR) 10Jelto: [V: 03+1 C: 03+2] vrts: auto-restart apache2 on-failure [puppet] - 10https://gerrit.wikimedia.org/r/988410 (https://phabricator.wikimedia.org/T354478) (owner: 10Jelto) [12:13:08] 10SRE-swift-storage, 10Patch-For-Review: Q3 ms backend refresh work - https://phabricator.wikimedia.org/T353149 (10CodeReviewBot) mvernon opened https://gitlab.wikimedia.org/repos/data_persistence/swift-ring/-/merge_requests/7 teach ring manager about more eqiad racks [12:13:30] (03CR) 10Jelto: [C: 03+2] clamav: add systemd override, enable restart on-failure (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/988081 (owner: 10Dzahn) [12:14:11] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:14:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54536 and previous config saved to /var/cache/conftool/dbconfig/20240108-121451-root.json [12:18:34] (KubernetesRsyslogDown) firing: (3) rsyslog on mw1380:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:21:24] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:debmonitor::server check host for CDN. [puppet] - 10https://gerrit.wikimedia.org/r/988454 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [12:21:31] 10SRE, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Discovery-Search (Current work): SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10Gehel) [12:23:31] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1224.eqiad.wmnet with reason: host reimage [12:24:27] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987781 (owner: 10Muehlenhoff) [12:24:43] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/988230 [12:26:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1224.eqiad.wmnet with reason: host reimage [12:29:29] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:openstack::designate: use cloud-private for memcached [puppet] - 10https://gerrit.wikimedia.org/r/977598 (owner: 10Majavah) [12:29:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54537 and previous config saved to /var/cache/conftool/dbconfig/20240108-122956-root.json [12:34:53] (03CR) 10Muehlenhoff: [C: 03+2] toolforge::docker::registry: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/987781 (owner: 10Muehlenhoff) [12:38:37] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [12:42:13] (03PS1) 10Marostegui: Revert "db1224: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/988256 [12:45:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54538 and previous config saved to /var/cache/conftool/dbconfig/20240108-124501-root.json [12:46:30] (03PS3) 10Majavah: P:openldap: convert to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/982394 [12:46:32] (03PS4) 10Majavah: O:openldap::rw: don't allow queries from Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/982395 (https://phabricator.wikimedia.org/T317184) [12:46:34] (03CR) 10Marostegui: [C: 03+2] Revert "db1224: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/988256 (owner: 10Marostegui) [12:46:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1224.eqiad.wmnet with OS bookworm [12:46:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54539 and previous config saved to /var/cache/conftool/dbconfig/20240108-124647-root.json [12:46:49] (03CR) 10Majavah: P:openldap: convert to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/982394 (owner: 10Majavah) [12:48:07] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1044/co" [puppet] - 10https://gerrit.wikimedia.org/r/982394 (owner: 10Majavah) [12:49:48] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/982394 (owner: 10Majavah) [12:50:21] (03PS1) 10Slyngshede: P:url_downloader decommission Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/988481 (https://phabricator.wikimedia.org/T350694) [12:52:07] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:openldap: convert to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/982394 (owner: 10Majavah) [12:52:15] (ProbeDown) firing: (7) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:56:55] (03PS1) 10Kosta Harlan: ProductionServices: Add entry for ipoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988482 (https://phabricator.wikimedia.org/T325147) [12:57:28] !bash tstarling > I mean, it's a lot of code. No doubt it was more fun to write than it is to read. [12:57:28] Amir1: Stored quip at https://bash.toolforge.org/quip/S28m6YwBhuQtenzvRf7A [12:57:33] (03PS2) 10Peter Fischer: enable page_rerender for 3rd batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988449 (https://phabricator.wikimedia.org/T351503) [12:57:47] (03CR) 10CI reject: [V: 04-1] ProductionServices: Add entry for ipoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988482 (https://phabricator.wikimedia.org/T325147) (owner: 10Kosta Harlan) [12:58:07] (03CR) 10Peter Fischer: "Fixed commit message." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988449 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [13:00:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54540 and previous config saved to /var/cache/conftool/dbconfig/20240108-130006-root.json [13:00:08] (03PS2) 10Kosta Harlan: ProductionServices: Add entry for ipoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988482 (https://phabricator.wikimedia.org/T325147) [13:00:29] (03PS2) 10Majavah: hieradata: unconfigure wiki replica LVS services [puppet] - 10https://gerrit.wikimedia.org/r/978539 (https://phabricator.wikimedia.org/T346947) [13:00:31] (03PS1) 10Majavah: hieradata: remove wikireplica service catalog entries [puppet] - 10https://gerrit.wikimedia.org/r/988483 (https://phabricator.wikimedia.org/T346947) [13:01:43] 10SRE, 10SRE-swift-storage, 10Commons, 10User-ArielGlenn: Generate a list of files that are supposed to exist but 404s - https://phabricator.wikimedia.org/T182822 (10jcrespo) Related: T289996 In particular, even if it doesn't affect commons, enwikivoyage has lots of references to old files that were not i... [13:01:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54541 and previous config saved to /var/cache/conftool/dbconfig/20240108-130152-root.json [13:02:54] (03PS1) 10Majavah: wmnet: remove aliases for dbproxy1018/9 [dns] - 10https://gerrit.wikimedia.org/r/988484 (https://phabricator.wikimedia.org/T346947) [13:03:44] (03PS5) 10Jgiannelos: wikifeeds: Use core page HTML in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/987158 (https://phabricator.wikimedia.org/T347027) [13:05:07] (03PS1) 10Majavah: P:wmcs: wikireplicas: remove cloudproxy [puppet] - 10https://gerrit.wikimedia.org/r/988485 [13:09:20] (03PS1) 10Santiago Faci: Deploying edit-analytics to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/988486 (https://phabricator.wikimedia.org/T354074) [13:10:41] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:15:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54542 and previous config saved to /var/cache/conftool/dbconfig/20240108-131511-root.json [13:15:19] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:16:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54543 and previous config saved to /var/cache/conftool/dbconfig/20240108-131657-root.json [13:22:15] (ProbeDown) firing: (8) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:22:58] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) it seems that orchestrator follows the same pattern as the one @Marostegui identified here: >>! In T352974#9389945, @Marostegui wrote... [13:23:54] (03PS3) 10Majavah: P:toolforge::mailrelay: log mail sent from non-Toolforge domains [puppet] - 10https://gerrit.wikimedia.org/r/971894 (https://phabricator.wikimedia.org/T341004) [13:23:56] (03PS3) 10Majavah: P:toolforge::mailrelay: rewrite maintainers in Python [puppet] - 10https://gerrit.wikimedia.org/r/971891 (https://phabricator.wikimedia.org/T341006) [13:23:58] (03PS3) 10Majavah: P:toolforge::mailrelay: add List-Id header for tool mail [puppet] - 10https://gerrit.wikimedia.org/r/971893 [13:24:00] (03PS3) 10Majavah: P:toolforge::mailrelay: only relay for Toolforge, not Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/971892 [13:24:27] (03PS4) 10Majavah: P:toolforge::mailrelay: only relay for Toolforge, not Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/971892 [13:24:29] (03PS4) 10Majavah: P:toolforge::mailrelay: add List-Id header for tool mail [puppet] - 10https://gerrit.wikimedia.org/r/971893 [13:25:31] (03CR) 10Alexandros Kosiaris: [C: 03+1] "One minor nit that is not a blocker. LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/977983 (https://phabricator.wikimedia.org/T344982) (owner: 10KartikMistry) [13:29:18] (03PS4) 10Majavah: P:toolforge::mailrelay: rewrite maintainers in Python [puppet] - 10https://gerrit.wikimedia.org/r/971891 (https://phabricator.wikimedia.org/T341006) [13:29:20] (03PS5) 10Majavah: P:toolforge::mailrelay: only relay for Toolforge, not Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/971892 [13:29:22] (03PS5) 10Majavah: P:toolforge::mailrelay: add List-Id header for tool mail [puppet] - 10https://gerrit.wikimedia.org/r/971893 [13:29:24] (03PS1) 10Majavah: P:toolforge::mailrelay: double-sign mail with RSA DKIM keys [puppet] - 10https://gerrit.wikimedia.org/r/988489 (https://phabricator.wikimedia.org/T354112) [13:29:36] (03CR) 10FNegri: [C: 03+1] P:toolforge::mailrelay: log mail sent from non-Toolforge domains [puppet] - 10https://gerrit.wikimedia.org/r/971894 (https://phabricator.wikimedia.org/T341004) (owner: 10Majavah) [13:30:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54544 and previous config saved to /var/cache/conftool/dbconfig/20240108-133016-root.json [13:31:23] !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [13:31:30] (03PS1) 10Slyngshede: P:debmonitor::server Switch to monitoring CDN endpoint. [puppet] - 10https://gerrit.wikimedia.org/r/988490 [13:32:00] !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [13:32:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54545 and previous config saved to /var/cache/conftool/dbconfig/20240108-133202-root.json [13:32:25] (03CR) 10DCausse: [C: 03+1] enable page_rerender for 3rd batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988449 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [13:33:03] !log jelto@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [13:33:32] !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [13:36:02] (03CR) 10Alexandros Kosiaris: [C: 04-1] "The key used isn't present in the software, nor the chart (at least according to codesearch). Is there some other commit missing?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (owner: 10Htriedman) [13:36:32] (03PS4) 10Majavah: P:toolforge::mailrelay: log mail sent from non-Toolforge domains [puppet] - 10https://gerrit.wikimedia.org/r/971894 (https://phabricator.wikimedia.org/T341004) [13:36:34] (03PS2) 10Majavah: P:toolforge::mailrelay: double-sign mail with RSA DKIM keys [puppet] - 10https://gerrit.wikimedia.org/r/988489 (https://phabricator.wikimedia.org/T354112) [13:36:36] (03PS5) 10Majavah: P:toolforge::mailrelay: rewrite maintainers in Python [puppet] - 10https://gerrit.wikimedia.org/r/971891 (https://phabricator.wikimedia.org/T341006) [13:36:38] (03PS6) 10Majavah: P:toolforge::mailrelay: only relay for Toolforge, not Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/971892 [13:36:40] (03PS6) 10Majavah: P:toolforge::mailrelay: add List-Id header for tool mail [puppet] - 10https://gerrit.wikimedia.org/r/971893 [13:36:48] (03CR) 10CI reject: [V: 04-1] P:debmonitor::server Switch to monitoring CDN endpoint. [puppet] - 10https://gerrit.wikimedia.org/r/988490 (owner: 10Slyngshede) [13:38:15] (03PS2) 10Slyngshede: P:debmonitor::server rework debmonitor http monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/988490 (https://phabricator.wikimedia.org/T350694) [13:38:27] (03CR) 10FNegri: P:toolforge::mailrelay: double-sign mail with RSA DKIM keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/988489 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah) [13:38:53] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] enable page_rerender for 3rd batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988449 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [13:39:18] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2083:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:42:58] (03CR) 10CI reject: [V: 04-1] P:debmonitor::server rework debmonitor http monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/988490 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:43:03] (03PS5) 10Majavah: P:toolforge::mailrelay: log mail sent from non-Toolforge domains [puppet] - 10https://gerrit.wikimedia.org/r/971894 (https://phabricator.wikimedia.org/T341004) [13:43:05] (03PS3) 10Majavah: P:toolforge::mailrelay: double-sign mail with RSA DKIM keys [puppet] - 10https://gerrit.wikimedia.org/r/988489 (https://phabricator.wikimedia.org/T354112) [13:43:07] (03PS6) 10Majavah: P:toolforge::mailrelay: rewrite maintainers in Python [puppet] - 10https://gerrit.wikimedia.org/r/971891 (https://phabricator.wikimedia.org/T341006) [13:43:09] (03PS7) 10Majavah: P:toolforge::mailrelay: only relay for Toolforge, not Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/971892 [13:43:11] (03PS7) 10Majavah: P:toolforge::mailrelay: add List-Id header for tool mail [puppet] - 10https://gerrit.wikimedia.org/r/971893 [13:43:19] (03PS1) 10KartikMistry: testwiki: Enable Section translation on WPs with potential to be supported with MinT using MADLAD-400 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988493 (https://phabricator.wikimedia.org/T353510) [13:44:24] (03PS3) 10Slyngshede: P:debmonitor::server rework debmonitor http monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/988490 (https://phabricator.wikimedia.org/T350694) [13:45:57] (03PS1) 10Majavah: Add fake toolforge-rsa DKIM keys [labs/private] - 10https://gerrit.wikimedia.org/r/988494 (https://phabricator.wikimedia.org/T354112) [13:46:00] (03CR) 10FNegri: [C: 03+1] "AFAIK these are indeed not in use. Were they used before the LVS setup was introduced? In any case I think they can go, unless @btullis di" [puppet] - 10https://gerrit.wikimedia.org/r/988485 (owner: 10Majavah) [13:46:08] 10SRE-swift-storage, 10Patch-For-Review: Q3 ms backend refresh work - https://phabricator.wikimedia.org/T353149 (10CodeReviewBot) mvernon merged https://gitlab.wikimedia.org/repos/data_persistence/swift-ring/-/merge_requests/7 teach ring manager about more eqiad racks [13:47:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54546 and previous config saved to /var/cache/conftool/dbconfig/20240108-134707-root.json [13:47:29] (03PS1) 10Jelto: miscweb: set requests and limit for bugzilla staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/988495 (https://phabricator.wikimedia.org/T300171) [13:48:23] (03CR) 10FNegri: "I would also remove the comments in hieradata/hosts/dbproxy101[89].yaml that mention "wikireplica-web" and "wikireplica-analytics"." [dns] - 10https://gerrit.wikimedia.org/r/988484 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [13:48:47] (03CR) 10Majavah: [C: 03+2] P:toolforge::mailrelay: log mail sent from non-Toolforge domains [puppet] - 10https://gerrit.wikimedia.org/r/971894 (https://phabricator.wikimedia.org/T341004) (owner: 10Majavah) [13:49:52] (03PS4) 10Majavah: P:toolforge::mailrelay: double-sign mail with RSA DKIM keys [puppet] - 10https://gerrit.wikimedia.org/r/988489 (https://phabricator.wikimedia.org/T354112) [13:49:54] (03PS7) 10Majavah: P:toolforge::mailrelay: rewrite maintainers in Python [puppet] - 10https://gerrit.wikimedia.org/r/971891 (https://phabricator.wikimedia.org/T341006) [13:49:56] (03PS8) 10Majavah: P:toolforge::mailrelay: only relay for Toolforge, not Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/971892 [13:49:58] (03PS8) 10Majavah: P:toolforge::mailrelay: add List-Id header for tool mail [puppet] - 10https://gerrit.wikimedia.org/r/971893 [13:51:15] (03CR) 10Majavah: P:toolforge::mailrelay: double-sign mail with RSA DKIM keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/988489 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah) [13:54:25] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) for the orchestrator part, it seems that mariadb client [[ https://github.com/wikimedia/operations-puppet/blob/6d6dc6f4cae913de17bfb4... [13:55:27] (03CR) 10JMeybohm: [C: 03+1] miscweb: set requests and limit for bugzilla staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/988495 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [13:59:07] (03CR) 10Slyngshede: "Translating the Icinga monitoring to Prometheus and BlackBox exporter is a little confusing, made worse by the different host headers, but" [puppet] - 10https://gerrit.wikimedia.org/r/988490 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:59:31] (03CR) 10Jelto: [C: 03+2] miscweb: set requests and limit for bugzilla staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/988495 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240108T1400). nyaa~ [14:00:05] pfischer and phuedx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:10] o/ [14:00:28] O/ [14:00:49] i can deploy today! [14:00:55] ok! [14:01:01] unless Lucas_WMDE wants to of course :)) [14:01:09] nah, I don’t have to ^^ [14:01:15] (03Merged) 10jenkins-bot: miscweb: set requests and limit for bugzilla staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/988495 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [14:01:21] hehe [14:01:30] (03PS3) 10Urbanecm: enable page_rerender for 3rd batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988449 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [14:01:33] (03CR) 10Urbanecm: [C: 03+2] enable page_rerender for 3rd batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988449 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [14:01:57] !log installing curl security updates [14:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54547 and previous config saved to /var/cache/conftool/dbconfig/20240108-140212-root.json [14:02:25] (03Merged) 10jenkins-bot: enable page_rerender for 3rd batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988449 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [14:02:45] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:988449|enable page_rerender for 3rd batch of wikis (T351503)]] [14:02:49] T351503: Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 [14:03:48] (03CR) 10Majavah: [C: 03+2] P:wmcs: wikireplicas: remove cloudproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/988485 (owner: 10Majavah) [14:04:11] !log urbanecm@deploy2002 pfischer and urbanecm: Backport for [[gerrit:988449|enable page_rerender for 3rd batch of wikis (T351503)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:04:14] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/984164 (owner: 10Muehlenhoff) [14:04:25] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [14:04:40] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [14:04:42] pfischer: can you check at mwdebug2001, please? :) [14:04:50] Sure, one sec. [14:05:29] (03CR) 10FNegri: [C: 03+1] Add fake toolforge-rsa DKIM keys [labs/private] - 10https://gerrit.wikimedia.org/r/988494 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah) [14:05:42] (03CR) 10Majavah: [V: 03+2 C: 03+2] Add fake toolforge-rsa DKIM keys [labs/private] - 10https://gerrit.wikimedia.org/r/988494 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah) [14:06:02] (03CR) 10Hnowlan: [C: 04-1] Deploying edit-analytics to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/988486 (https://phabricator.wikimedia.org/T354074) (owner: 10Santiago Faci) [14:06:06] (03CR) 10Btullis: [C: 03+1] "Looks good to me, thanks Taavi." [dns] - 10https://gerrit.wikimedia.org/r/988484 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [14:06:17] urbanecm: +2 [14:06:28] proceeding [14:06:29] (03PS1) 10David Caro: lighthttpd: don't remove environment vars [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/988498 (https://phabricator.wikimedia.org/T354320) [14:06:29] !log urbanecm@deploy2002 pfischer and urbanecm: Continuing with sync [14:06:38] (03PS2) 10Urbanecm: Add agent.app_install_id to android.product_metrics.* streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987159 (https://phabricator.wikimedia.org/T353680) (owner: 10Phuedx) [14:06:40] (03CR) 10Urbanecm: [C: 03+2] Add agent.app_install_id to android.product_metrics.* streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987159 (https://phabricator.wikimedia.org/T353680) (owner: 10Phuedx) [14:06:46] (03PS3) 10Urbanecm: Remove partial migration of EditAttemptStep instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982467 (https://phabricator.wikimedia.org/T351335) (owner: 10Santiago Faci) [14:06:49] (03CR) 10Urbanecm: [C: 03+2] Remove partial migration of EditAttemptStep instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982467 (https://phabricator.wikimedia.org/T351335) (owner: 10Santiago Faci) [14:06:57] (03PS2) 10Urbanecm: Add new stream names to the config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982903 (https://phabricator.wikimedia.org/T353297) (owner: 10Kimberly Sarabia) [14:07:01] (03CR) 10Urbanecm: [C: 03+2] Add new stream names to the config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982903 (https://phabricator.wikimedia.org/T353297) (owner: 10Kimberly Sarabia) [14:07:51] urbanecm: I can test all three of those on the deployment server [14:08:02] phuedx: you mean at mwdebug, right? [14:08:02] Just let me know when [14:08:08] (03Merged) 10jenkins-bot: Add agent.app_install_id to android.product_metrics.* streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987159 (https://phabricator.wikimedia.org/T353680) (owner: 10Phuedx) [14:08:09] *debug server [14:08:11] (03Merged) 10jenkins-bot: Remove partial migration of EditAttemptStep instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982467 (https://phabricator.wikimedia.org/T351335) (owner: 10Santiago Faci) [14:08:14] yup, will let you know once available there. [14:08:15] (03Merged) 10jenkins-bot: Add new stream names to the config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982903 (https://phabricator.wikimedia.org/T353297) (owner: 10Kimberly Sarabia) [14:08:24] (03CR) 10Hnowlan: [C: 03+1] wikifeeds: Use core page HTML in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/987158 (https://phabricator.wikimedia.org/T347027) (owner: 10Jgiannelos) [14:11:07] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:20] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:988449|enable page_rerender for 3rd batch of wikis (T351503)]] (duration: 09m 35s) [14:12:24] T351503: Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 [14:12:56] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:987159|Add agent.app_install_id to android.product_metrics.* streams (T353680)]], [[gerrit:982467|Remove partial migration of EditAttemptStep instrument (T351335)]], [[gerrit:982903|Add new stream names to the config variable (T353297)]] [14:13:03] T353680: Android Metrics Platform Migration Data Validation - first pass - first 4 tables - https://phabricator.wikimedia.org/T353680 [14:13:03] T351335: Remove partial migration of EditAttemptStep instrument - https://phabricator.wikimedia.org/T351335 [14:13:04] T353297: Empty tables for mediawiki_web_ui_scroll_migrated and mediawiki_web_ui_actions - https://phabricator.wikimedia.org/T353297 [14:14:16] (03PS1) 10Jelto: miscweb: also set requests for bugzilla staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/988499 (https://phabricator.wikimedia.org/T300171) [14:14:43] !log urbanecm@deploy2002 urbanecm and phuedx and ksarabia and sfaci: Backport for [[gerrit:987159|Add agent.app_install_id to android.product_metrics.* streams (T353680)]], [[gerrit:982467|Remove partial migration of EditAttemptStep instrument (T351335)]], [[gerrit:982903|Add new stream names to the config variable (T353297)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:14:48] (03PS1) 10Peter Fischer: Search update pipeline: 3rd batch page_rerender [deployment-charts] - 10https://gerrit.wikimedia.org/r/988500 (https://phabricator.wikimedia.org/T351503) [14:15:03] phuedx: your patch is at mwdebug, can you test? [14:15:16] (03CR) 10Jelto: "follow up for I0c473e4347f611d44154ebc47ab6dccda2cbbb9f :) I forgot about the requests." [deployment-charts] - 10https://gerrit.wikimedia.org/r/988499 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [14:15:25] urbanecm: All of the patches? [14:15:30] correct [14:15:35] Thanks. Will do. [14:17:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54548 and previous config saved to /var/cache/conftool/dbconfig/20240108-141717-root.json [14:18:03] https://gerrit.wikimedia.org/r/982467 LGTM. I don't see the mediawiki.edit_attempt stream any more and nothing is being sent to the browser on a regular pageview [14:18:05] Testing the others [14:19:11] (03CR) 10Andrew Bogott: [C: 03+1] "!" [labs/private] - 10https://gerrit.wikimedia.org/r/988084 (https://phabricator.wikimedia.org/T84536) (owner: 10Dzahn) [14:21:22] urbanecm: https://gerrit.wikimedia.org/r/982903 LGTM [14:22:09] phuedx: awesome. 987159 looks good as well? or should i wait? [14:22:17] But https://gerrit.wikimedia.org/r/987159 will have to be reverted as it has a typo in it. If it's complicated to do now, I can submit the revert later [14:22:24] It won't break anything if it's deployed [14:22:26] okay. [14:22:29] i'll revert it now [14:22:34] +1 [14:23:06] phuedx: or if submitting a follow-up and fixing the typo is easy, feel free to upload that patch as well [14:23:36] One sec [14:23:40] ok, waiting [14:26:12] (03PS1) 10Phuedx: agent.app_ -> agent_app_ in android.product_metrics.* streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988504 (https://phabricator.wikimedia.org/T353680) [14:26:47] There is a task to lint those values during CI. It should be prioritized :) [14:27:31] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 10 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: WIP [14:27:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: WIP [14:29:36] 10SRE, 10Thumbor, 10MediaModeration (MediaModeration 2.0), 10Trust and Safety Product Sprint (Sprint 4 (8th Jan.‘24 - 19th Jan.'24)): Error creating thumbnail: Unknown option --no-external-files - https://phabricator.wikimedia.org/T354407 (10hnowlan) Judging by T104147, this looks like a hack that can be r... [14:29:55] (03CR) 10Giuseppe Lavagetto: service.yaml: add iPoid to the service catalogue (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928487 (https://phabricator.wikimedia.org/T325147) (owner: 10Effie Mouzeli) [14:30:36] (03PS3) 10Andrew Bogott: disable_tool: remove the archive_db stage from the cron host [puppet] - 10https://gerrit.wikimedia.org/r/987187 (https://phabricator.wikimedia.org/T353642) [14:30:38] (03PS1) 10Andrew Bogott: profile::toolforge::nfs_disable_tool: install mariadb-client [puppet] - 10https://gerrit.wikimedia.org/r/988505 [14:33:22] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add ipoid to the service mesh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/988453 (https://phabricator.wikimedia.org/T325147) (owner: 10Kamila Součková) [14:34:07] urbanecm: https://gerrit.wikimedia.org/r/988504 [14:34:23] (03CR) 10Urbanecm: [C: 03+2] agent.app_ -> agent_app_ in android.product_metrics.* streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988504 (https://phabricator.wikimedia.org/T353680) (owner: 10Phuedx) [14:34:31] let me pull it to mwdebug as well. [14:34:32] one sec [14:34:37] !log urbanecm@deploy2002 Sync cancelled. [14:34:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988504 (https://phabricator.wikimedia.org/T353680) (owner: 10Phuedx) [14:35:01] (03CR) 10CI reject: [V: 04-1] profile::toolforge::nfs_disable_tool: install mariadb-client [puppet] - 10https://gerrit.wikimedia.org/r/988505 (owner: 10Andrew Bogott) [14:35:31] (03Merged) 10jenkins-bot: agent.app_ -> agent_app_ in android.product_metrics.* streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988504 (https://phabricator.wikimedia.org/T353680) (owner: 10Phuedx) [14:35:40] (03CR) 10Ssingh: [C: 03+1] "Let us know if you want to roll it out." [puppet] - 10https://gerrit.wikimedia.org/r/978539 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [14:35:48] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:987159|Add agent.app_install_id to android.product_metrics.* streams (T353680)]], [[gerrit:982467|Remove partial migration of EditAttemptStep instrument (T351335)]], [[gerrit:982903|Add new stream names to the config variable (T353297)]], [[gerrit:988504|agent.app_ -> agent_app_ in android.product_metrics.* streams (T353680)]] [14:36:05] T353680: Android Metrics Platform Migration Data Validation - first pass - first 4 tables - https://phabricator.wikimedia.org/T353680 [14:36:05] T351335: Remove partial migration of EditAttemptStep instrument - https://phabricator.wikimedia.org/T351335 [14:36:05] T353297: Empty tables for mediawiki_web_ui_scroll_migrated and mediawiki_web_ui_actions - https://phabricator.wikimedia.org/T353297 [14:36:56] (03PS5) 10Majavah: P:toolforge::mailrelay: double-sign mail with RSA DKIM keys [puppet] - 10https://gerrit.wikimedia.org/r/988489 (https://phabricator.wikimedia.org/T354112) [14:36:58] (03PS8) 10Majavah: P:toolforge::mailrelay: rewrite maintainers in Python [puppet] - 10https://gerrit.wikimedia.org/r/971891 (https://phabricator.wikimedia.org/T341006) [14:37:00] (03PS9) 10Majavah: P:toolforge::mailrelay: only relay for Toolforge, not Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/971892 [14:37:02] (03PS9) 10Majavah: P:toolforge::mailrelay: add List-Id header for tool mail [puppet] - 10https://gerrit.wikimedia.org/r/971893 [14:37:13] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:13] !log urbanecm@deploy2002 urbanecm and phuedx and ksarabia and sfaci: Backport for [[gerrit:987159|Add agent.app_install_id to android.product_metrics.* streams (T353680)]], [[gerrit:982467|Remove partial migration of EditAttemptStep instrument (T351335)]], [[gerrit:982903|Add new stream names to the config variable (T353297)]], [[gerrit:988504|agent.app_ -> agent_app_ in android.product_metrics.* streams (T353680)]] synce [14:37:13] d to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:37:25] phuedx: mind checking it works now? :) [14:37:50] (03CR) 10FNegri: [C: 03+1] P:toolforge::mailrelay: double-sign mail with RSA DKIM keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/988489 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah) [14:38:40] urbanecm: On it [14:39:17] (03CR) 10FNegri: "@dcaro do you still think this can be useful? I was a bit scared of merging something right before the holidays, but we can test it now." [puppet] - 10https://gerrit.wikimedia.org/r/983139 (owner: 10David Caro) [14:40:00] urbanecm: The config for those event streams LGTM [14:40:07] !log urbanecm@deploy2002 urbanecm and phuedx and ksarabia and sfaci: Continuing with sync [14:40:19] awesome, proceeding [14:41:03] Thanks for waiting urbanecm [14:41:08] no worries [14:41:38] (03PS1) 10Kamila Součková: TEMPORARY role for debugging T354413 for mw1377 [puppet] - 10https://gerrit.wikimedia.org/r/988507 (https://phabricator.wikimedia.org/T354413) [14:42:10] (03CR) 10Majavah: [C: 03+2] P:toolforge::mailrelay: double-sign mail with RSA DKIM keys [puppet] - 10https://gerrit.wikimedia.org/r/988489 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah) [14:43:09] (03CR) 10CI reject: [V: 04-1] TEMPORARY role for debugging T354413 for mw1377 [puppet] - 10https://gerrit.wikimedia.org/r/988507 (https://phabricator.wikimedia.org/T354413) (owner: 10Kamila Součková) [14:46:00] (03PS5) 10EoghanGaffney: [gerrit] Refactor classes to specify a primary host [puppet] - 10https://gerrit.wikimedia.org/r/988464 [14:46:10] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:987159|Add agent.app_install_id to android.product_metrics.* streams (T353680)]], [[gerrit:982467|Remove partial migration of EditAttemptStep instrument (T351335)]], [[gerrit:982903|Add new stream names to the config variable (T353297)]], [[gerrit:988504|agent.app_ -> agent_app_ in android.product_metrics.* streams (T353680)]] (duration: 10m 22s) [14:46:17] (03PS2) 10Kamila Součková: TEMPORARY role for debugging T354413 for mw1377 [puppet] - 10https://gerrit.wikimedia.org/r/988507 (https://phabricator.wikimedia.org/T354413) [14:46:19] T353680: Android Metrics Platform Migration Data Validation - first pass - first 4 tables - https://phabricator.wikimedia.org/T353680 [14:46:19] T351335: Remove partial migration of EditAttemptStep instrument - https://phabricator.wikimedia.org/T351335 [14:46:19] T353297: Empty tables for mediawiki_web_ui_scroll_migrated and mediawiki_web_ui_actions - https://phabricator.wikimedia.org/T353297 [14:46:20] phuedx: and live [14:46:23] anything else? [14:46:50] (03CR) 10EoghanGaffney: "The PCC build shows nothing except to the parameters of the affected classes, with no changes on the host." [puppet] - 10https://gerrit.wikimedia.org/r/988464 (owner: 10EoghanGaffney) [14:48:57] PROBLEM - Check systemd state on db1211 is CRITICAL: CRITICAL - degraded: The following units failed: user@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:48:59] (03CR) 10Majavah: [C: 03+2] wmnet: remove aliases for dbproxy1018/9 [dns] - 10https://gerrit.wikimedia.org/r/988484 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [14:49:58] (03PS2) 10Andrew Bogott: profile::toolforge::nfs_disable_tool: install mariadb-client [puppet] - 10https://gerrit.wikimedia.org/r/988505 (https://phabricator.wikimedia.org/T353642) [14:50:00] (03PS4) 10Andrew Bogott: disable_tool: remove the archive_db stage from the cron host [puppet] - 10https://gerrit.wikimedia.org/r/987187 (https://phabricator.wikimedia.org/T353642) [14:51:13] PROBLEM - Check systemd state on kafka-logging2004 is CRITICAL: CRITICAL - degraded: The following units failed: user@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:51:55] (03CR) 10FNegri: [C: 03+2] wmcs_wheel_of_misfortune: exclude uid<1000 [puppet] - 10https://gerrit.wikimedia.org/r/988024 (https://phabricator.wikimedia.org/T354430) (owner: 10FNegri) [14:52:07] RECOVERY - Check systemd state on db1211 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:52:49] RECOVERY - Check systemd state on kafka-logging2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:04] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10cmooney) >>! In T300152#9440438, @MoritzMuehlenhoff wrote: >>>! In T300152#9437438, @ayounsi wrote: >> On naming I didn't use `private1-ganeti-codfw` as I didn't want to... [14:54:55] (03CR) 10Kamila Součková: [C: 03+2] TEMPORARY role for debugging T354413 for mw1377 [puppet] - 10https://gerrit.wikimedia.org/r/988507 (https://phabricator.wikimedia.org/T354413) (owner: 10Kamila Součková) [14:57:13] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:04:13] 10SRE, 10ops-eqiad: Degraded RAID on dumpsdata1006 - https://phabricator.wikimedia.org/T354143 (10Jclark-ctr) 05Open→03Resolved Replaced Failed Drive [15:04:42] (03CR) 10Krinkle: [C: 03+2] Fix parsing logic when comments or hidden characters are present [extensions/Gadgets] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987999 (https://phabricator.wikimedia.org/T354385) (owner: 10Krinkle) [15:05:17] urbanecm: done? [15:05:51] Krinkle: yup! [15:08:27] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Hardware): Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Jclark-ctr) Reopened Ticket with Dell [15:09:12] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission wdqs10[09-10].eqiad.wmnet - https://phabricator.wikimedia.org/T353482 (10Jclark-ctr) @RKemper unsure if this is ready for dcops? [15:11:28] 10SRE, 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T353913 (10Jclark-ctr) a:03VRiley-WMF [15:11:29] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:12:01] (03CR) 10Clément Goubert: service.yaml: add iPoid to the service catalogue (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928487 (https://phabricator.wikimedia.org/T325147) (owner: 10Effie Mouzeli) [15:12:12] (03PS1) 10Kamila Součková: TEMPORARY for debugging T354413: enable overlayfs [puppet] - 10https://gerrit.wikimedia.org/r/988510 [15:12:15] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:12:51] (03PS2) 10Kamila Součková: TEMPORARY for debugging T354413: enable overlayfs [puppet] - 10https://gerrit.wikimedia.org/r/988510 [15:13:29] (03CR) 10Hnowlan: [C: 03+1] "I wish there were a nicer way to avoid all the repetition from Server in AsyncServer but I don't think there is without getting into ugly " [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/980918 (https://phabricator.wikimedia.org/T338297) (owner: 10Giuseppe Lavagetto) [15:14:01] 10SRE, 10ops-eqiad: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T354499 (10Jclark-ctr) a:03Jclark-ctr [15:14:14] (03CR) 10Kamila Součková: [C: 03+2] TEMPORARY for debugging T354413: enable overlayfs [puppet] - 10https://gerrit.wikimedia.org/r/988510 (owner: 10Kamila Součková) [15:14:42] (03CR) 10Ottomata: update eventstream helm values.yaml file to include hard-coded list of redacted pages (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (owner: 10Htriedman) [15:15:03] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:16:40] (03CR) 10Hnowlan: [C: 04-1] Deploying edit-analytics to production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/988486 (https://phabricator.wikimedia.org/T354074) (owner: 10Santiago Faci) [15:17:45] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.288 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:18:07] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:18:31] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51305 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:19:15] RECOVERY - Check systemd state on kubernetes2024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:27] RECOVERY - Check systemd state on mw1465 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:42] (03PS2) 10Btullis: Switch s7-analytics-replica to dbstore1008 [dns] - 10https://gerrit.wikimedia.org/r/987425 (https://phabricator.wikimedia.org/T351921) [15:22:05] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10MoritzMuehlenhoff) >>! In T300152#9441778, @cmooney wrote: >>>! In T300152#9440438, @MoritzMuehlenhoff wrote: >>>>! In T300152#9437438, @ayounsi wrote: >>> On naming I d... [15:23:13] (03PS2) 10Santiago Faci: Deploying edit-analytics to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/988486 (https://phabricator.wikimedia.org/T354074) [15:23:29] happy new year all :D [15:23:58] (03Merged) 10jenkins-bot: Fix parsing logic when comments or hidden characters are present [extensions/Gadgets] (wmf/1.42.0-wmf.12) - 10https://gerrit.wikimedia.org/r/987999 (https://phabricator.wikimedia.org/T354385) (owner: 10Krinkle) [15:24:17] that must have been one loooong ny eve party! [15:24:22] (03CR) 10Santiago Faci: "Removed the 'version' line from the values-staging.yaml file" [deployment-charts] - 10https://gerrit.wikimedia.org/r/988486 (https://phabricator.wikimedia.org/T354074) (owner: 10Santiago Faci) [15:24:58] !log krinkle@deploy2002 Started scap: Backport for [[gerrit:987999|Fix parsing logic when comments or hidden characters are present (T354385)]] [15:25:08] T354385: Some Gadgets are broken after 1.42.0-wmf.12 update with incorrect error message - https://phabricator.wikimedia.org/T354385 [15:26:24] !log krinkle@deploy2002 krinkle: Backport for [[gerrit:987999|Fix parsing logic when comments or hidden characters are present (T354385)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:26:46] !log krinkle@deploy2002 krinkle: Continuing with sync [15:26:58] tested via https://test.wikipedia.org/wiki/MediaWiki:Gadgets-definition [15:28:25] (03CR) 10Hnowlan: [C: 03+1] Deploying edit-analytics to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/988486 (https://phabricator.wikimedia.org/T354074) (owner: 10Santiago Faci) [15:30:02] (03CR) 10Santiago Faci: [V: 03+2 C: 03+2] Deploying edit-analytics to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/988486 (https://phabricator.wikimedia.org/T354074) (owner: 10Santiago Faci) [15:30:33] RECOVERY - Check whether ferm is active by checking the default input chain on mw1465 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:31:09] (03Merged) 10jenkins-bot: Deploying edit-analytics to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/988486 (https://phabricator.wikimedia.org/T354074) (owner: 10Santiago Faci) [15:32:27] (03PS4) 10Hnowlan: changeprop-jobqueue: move PublishStashedFile back to non-k8s jobrunner [deployment-charts] - 10https://gerrit.wikimedia.org/r/983216 (https://phabricator.wikimedia.org/T349796) [15:32:50] !log krinkle@deploy2002 Finished scap: Backport for [[gerrit:987999|Fix parsing logic when comments or hidden characters are present (T354385)]] (duration: 07m 52s) [15:32:54] T354385: Some Gadgets are broken after 1.42.0-wmf.12 update with incorrect error message - https://phabricator.wikimedia.org/T354385 [15:33:50] (03PS1) 10Ladsgroup: Undeploy Listings extension, part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988654 (https://phabricator.wikimedia.org/T253216) [15:34:01] (03CR) 10JMeybohm: [C: 03+1] miscweb: also set requests for bugzilla staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/988499 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [15:35:27] !log Draining and cordoning kubestage2002.codfw.wmnet - T352883 [15:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:31] T352883: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 [15:35:54] 10SRE, 10Infrastructure-Foundations: service::docker with 'latest' version behaves poorly if the host runs out of disk space - https://phabricator.wikimedia.org/T321851 (10SLyngshede-WMF) a:03SLyngshede-WMF [15:36:17] 10SRE, 10Infrastructure-Foundations: service::docker with 'latest' version behaves poorly if the host runs out of disk space - https://phabricator.wikimedia.org/T321851 (10SLyngshede-WMF) Possible solution: clean the docker cache when pulling a new image. [15:37:19] jouncebot: nowandnext [15:37:19] No deployments scheduled for the next 0 hour(s) and 52 minute(s) [15:37:19] In 0 hour(s) and 52 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240108T1630) [15:37:37] (03CR) 10Ladsgroup: [C: 03+2] Undeploy Listings extension, part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988654 (https://phabricator.wikimedia.org/T253216) (owner: 10Ladsgroup) [15:37:41] (03CR) 10Jelto: [C: 03+2] miscweb: also set requests for bugzilla staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/988499 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [15:37:50] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Clement_Goubert) [15:38:06] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Clement_Goubert) 05Open→03In progress a:05Clement_Goubert→03Papaul Host is now drained and cordoned. It is in codfw rack... [15:38:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988654 (https://phabricator.wikimedia.org/T253216) (owner: 10Ladsgroup) [15:38:30] (03Merged) 10jenkins-bot: Undeploy Listings extension, part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988654 (https://phabricator.wikimedia.org/T253216) (owner: 10Ladsgroup) [15:38:39] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) Self dispatched 8 new drives for cloudcephosd1028 [15:38:53] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:988654|Undeploy Listings extension, part I (T253216)]] [15:38:57] T253216: Undeploy Extension:Listings from Wikimedia Production - https://phabricator.wikimedia.org/T253216 [15:39:11] (03Merged) 10jenkins-bot: miscweb: also set requests for bugzilla staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/988499 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [15:39:37] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: rsync::server::module installs an rsync server even when $ensure is absent - https://phabricator.wikimedia.org/T311066 (10jhathaway) a:03jhathaway [15:39:40] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) Scheduled next thursday to do the swap of the drives, will get the host out of the cluster before that. [15:39:48] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: rsync::server::module installs an rsync server even when $ensure is absent - https://phabricator.wikimedia.org/T311066 (10jhathaway) p:05Triage→03Low [15:40:11] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [15:40:18] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:988654|Undeploy Listings extension, part I (T253216)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:40:56] 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10Mail, 10Sustainability (Incident Followup): Upgrade Exim to 4.96 - https://phabricator.wikimedia.org/T310836 (10jhathaway) p:05Triage→03Low [15:41:01] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2024 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:41:04] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [15:41:13] 10SRE-tools, 10Infrastructure-Foundations: Read Ganeti cluster config for cookbooks from Netbox - https://phabricator.wikimedia.org/T340015 (10Volans) p:05Triage→03Medium [15:41:15] PROBLEM - Host mw1377 is DOWN: PING CRITICAL - Packet loss = 100% [15:42:17] RECOVERY - Host mw1377 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [15:43:05] PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:20] (KubernetesRsyslogDown) firing: (3) rsyslog on mw1380:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:43:22] (MDRAIDFailedDisk) resolved: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [15:43:34] 10SRE, 10Infrastructure-Foundations, 10LDAP: LDAP connections use TLSv1.0 and TLSv1.1 - https://phabricator.wikimedia.org/T329218 (10MoritzMuehlenhoff) p:05Triage→03Low a:03MoritzMuehlenhoff [15:45:01] PROBLEM - Host mw1377 is DOWN: PING CRITICAL - Packet loss = 100% [15:45:09] (03PS1) 10Kamila Součková: Revert "TEMPORARY for debugging T354413: enable overlayfs" [puppet] - 10https://gerrit.wikimedia.org/r/988263 [15:45:32] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [15:45:53] RECOVERY - Host mw1377 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [15:45:59] (03PS5) 10Eevans: restbase: set production role and add config for restbase2035 [puppet] - 10https://gerrit.wikimedia.org/r/981609 (https://phabricator.wikimedia.org/T352468) [15:46:00] 10SRE, 10Infrastructure-Foundations: Tweak Kerberos auth logging - https://phabricator.wikimedia.org/T331123 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff [15:46:17] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [15:46:22] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [15:46:56] mw1377 is inactive as per SAL. should we downtime it as well? [15:47:13] (03CR) 10Kamila Součková: [C: 03+2] Revert "TEMPORARY for debugging T354413: enable overlayfs" [puppet] - 10https://gerrit.wikimedia.org/r/988263 (owner: 10Kamila Součková) [15:47:15] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:988654|Undeploy Listings extension, part I (T253216)]] (duration: 08m 22s) [15:47:28] T253216: Undeploy Extension:Listings from Wikimedia Production - https://phabricator.wikimedia.org/T253216 [15:47:30] sukhe: yeah, I'll do that, kamila_ is using it to test some reboot issues [15:47:50] (03PS1) 10Ladsgroup: Undeploy listing extension part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988655 (https://phabricator.wikimedia.org/T253216) [15:47:54] ok, thank you, as long as it's known! [15:48:03] RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:20] (KubernetesRsyslogDown) resolved: (2) rsyslog on mw1380:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:48:37] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on mw1377.eqiad.wmnet with reason: reboot debugging [15:48:43] (03CR) 10Ladsgroup: [C: 03+2] Undeploy listing extension part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988655 (https://phabricator.wikimedia.org/T253216) (owner: 10Ladsgroup) [15:48:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988655 (https://phabricator.wikimedia.org/T253216) (owner: 10Ladsgroup) [15:48:54] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on mw1377.eqiad.wmnet with reason: reboot debugging [15:49:19] sukhe: btw, it's going to be a bit confusing for a while, we're moving mw appservers to be kubernetes nodes, but we're not renaming them yet [15:49:22] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [15:49:26] (03PS1) 10Kamila Součková: TEMPORARY for debugging T354413: kubernetes::node [puppet] - 10https://gerrit.wikimedia.org/r/988656 (https://phabricator.wikimedia.org/T354413) [15:49:28] (03Merged) 10jenkins-bot: Undeploy listing extension part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988655 (https://phabricator.wikimedia.org/T253216) (owner: 10Ladsgroup) [15:49:28] this is one of them [15:49:40] (03CR) 10Eevans: [C: 03+2] restbase: set production role and add config for restbase2035 [puppet] - 10https://gerrit.wikimedia.org/r/981609 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans) [15:49:40] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:988655|Undeploy listing extension part II (T253216)]] [15:49:47] claime: all good! just keeping a check on hosts down as part of on-call [15:50:46] (03CR) 10Kamila Součková: [C: 03+2] TEMPORARY for debugging T354413: kubernetes::node [puppet] - 10https://gerrit.wikimedia.org/r/988656 (https://phabricator.wikimedia.org/T354413) (owner: 10Kamila Součková) [15:51:14] (03PS1) 10Ladsgroup: Undeploy Listings extension part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988658 (https://phabricator.wikimedia.org/T253216) [15:51:18] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:988655|Undeploy listing extension part II (T253216)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:52:09] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [15:52:13] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:53:18] (KubernetesRsyslogDown) firing: (3) rsyslog on mw1380:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:53:33] 10SRE, 10Bitu, 10Infrastructure-Foundations: Define the core attribute list managed in the IDM with all stakeholders - https://phabricator.wikimedia.org/T320805 (10SLyngshede-WMF) 05Open→03Invalid This failed to live up to our expectations, and we'll instead add attributes on a need basis. [15:53:36] 10SRE, 10Bitu, 10Infrastructure-Foundations: Build-out for self service - https://phabricator.wikimedia.org/T320801 (10SLyngshede-WMF) [15:53:43] (03PS2) 10Peter Fischer: Search update pipeline: update README [deployment-charts] - 10https://gerrit.wikimedia.org/r/987494 [15:56:06] (03CR) 10Btullis: [C: 03+2] Switch s7-analytics-replica to dbstore1008 [dns] - 10https://gerrit.wikimedia.org/r/987425 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis) [15:57:32] !log sfaci@deploy2002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [15:57:48] !log sfaci@deploy2002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [15:58:21] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:988655|Undeploy listing extension part II (T253216)]] (duration: 08m 40s) [15:58:24] T253216: Undeploy Extension:Listings from Wikimedia Production - https://phabricator.wikimedia.org/T253216 [15:59:08] !log sfaci@deploy2002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [15:59:29] !log sfaci@deploy2002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [15:59:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988658 (https://phabricator.wikimedia.org/T253216) (owner: 10Ladsgroup) [16:00:05] (03CR) 10Majavah: [C: 03+1] profile::toolforge::nfs_disable_tool: install mariadb-client [puppet] - 10https://gerrit.wikimedia.org/r/988505 (https://phabricator.wikimedia.org/T353642) (owner: 10Andrew Bogott) [16:00:47] (03Merged) 10jenkins-bot: Undeploy Listings extension part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988658 (https://phabricator.wikimedia.org/T253216) (owner: 10Ladsgroup) [16:01:00] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:988658|Undeploy Listings extension part III (T253216)]] [16:01:43] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10joanna_borun) [16:01:51] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:02:26] (03CR) 10Majavah: [C: 03+2] hieradata: unconfigure wiki replica LVS services [puppet] - 10https://gerrit.wikimedia.org/r/978539 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [16:02:35] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:02:38] 10SRE-tools, 10Infrastructure-Foundations: Long timeout on debmonitor client with server unreachable/unpingable - https://phabricator.wikimedia.org/T302205 (10fgiunchedi) It is still the case with `debmonitor-client` `0.3.2-1+deb11u1` when `debmonitor.discovery.wmnet` is unreachable (no ping, even connection r... [16:04:05] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 4.766 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:04:18] (03PS1) 10Btullis: Revert "Switch s7-analytics-replica to dbstore1008" [dns] - 10https://gerrit.wikimedia.org/r/988264 [16:04:49] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51305 bytes in 0.080 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:05:27] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) wmfdb [[ https://gitlab.wikimedia.org/repos/sre/wmfdb/-/jobs/186537 | has been released ]], I'll move on to orchestrator test/debugging [16:05:34] (03CR) 10Btullis: [C: 03+2] Revert "Switch s7-analytics-replica to dbstore1008" [dns] - 10https://gerrit.wikimedia.org/r/988264 (owner: 10Btullis) [16:06:20] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) \o/ thanks! [16:07:34] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) @ABran-WMF would you deploy that new version to cumin1001? [16:09:13] !log restart pybal on lvs1020 - T346947 [16:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:31] T346947: Move wiki replicas behind cloudlb - https://phabricator.wikimedia.org/T346947 [16:10:07] 10SRE-tools, 10Infrastructure-Foundations: Add GraphQL support to wmflib - https://phabricator.wikimedia.org/T341968 (10Volans) p:05Triage→03Low [16:10:17] (03Abandoned) 10Hnowlan: changeprop-jobqueue: move PublishStashedFile back to non-k8s jobrunner [deployment-charts] - 10https://gerrit.wikimedia.org/r/983216 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [16:11:34] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) oh shoot I have to build it to bullseye as well! let me check [16:11:38] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, 10observability: Investigate Junos Prometheus exporter - https://phabricator.wikimedia.org/T333210 (10cmooney) p:05Triage→03Low [16:13:03] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:13:20] ^ me, expected [16:13:37] PROBLEM - PyBal IPVS diff check on lvs1018 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:14:08] (03CR) 10Andrew Bogott: [C: 03+2] profile::toolforge::nfs_disable_tool: install mariadb-client [puppet] - 10https://gerrit.wikimedia.org/r/988505 (https://phabricator.wikimedia.org/T353642) (owner: 10Andrew Bogott) [16:14:28] (03PS1) 10Kamila Součková: TEMPORARY role for debugging T354413: puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/988664 (https://phabricator.wikimedia.org/T354413) [16:14:40] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:988658|Undeploy Listings extension part III (T253216)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:14:49] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [16:14:52] T253216: Undeploy Extension:Listings from Wikimedia Production - https://phabricator.wikimedia.org/T253216 [16:15:07] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission wdqs10[09-10].eqiad.wmnet - https://phabricator.wikimedia.org/T353482 (10RKemper) @Jclark-ctr Yes, these hosts are fully ready to be decom'd. [16:15:27] !log restart pybal on lvs1018 - T346947 [16:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:48] T346947: Move wiki replicas behind cloudlb - https://phabricator.wikimedia.org/T346947 [16:15:59] (03CR) 10Kamila Součková: [C: 03+2] TEMPORARY role for debugging T354413: puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/988664 (https://phabricator.wikimedia.org/T354413) (owner: 10Kamila Součková) [16:18:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti2033.codfw.wmnet [16:18:49] (03CR) 10Jforrester: "<3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988658 (https://phabricator.wikimedia.org/T253216) (owner: 10Ladsgroup) [16:20:52] 10SRE, 10Infrastructure-Foundations, 10Traffic: NetworkProbeLimit cookie should set samesite attribute - https://phabricator.wikimedia.org/T342624 (10joanna_borun) a:03ayounsi [16:20:53] !log lvs1020: sudo ipvsadm --delete-service --tcp-service 208.80.154.242:3311 (and all the way to :3318) - T346947 [16:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:58] T346947: Move wiki replicas behind cloudlb - https://phabricator.wikimedia.org/T346947 [16:21:20] 10SRE-tools, 10Infrastructure-Foundations: Package pyGNMI and dictdiffer to be used by cookbooks - https://phabricator.wikimedia.org/T340045 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [16:21:28] !log lvs1020: sudo ipvsadm --delete-service --tcp-service 208.80.154.243:3311 (and all the way to :3318) - T346947 [16:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:27] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:23:17] !log lvs1018: sudo ipvsadm --delete-service --tcp-service 208.80.154.242:3311 (and all the way to :3318) - T346947 [16:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:45] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Papaul) @Clement_Goubert thanks will work on it in a minute [16:24:18] !log lvs1018: sudo ipvsadm --delete-service --tcp-service 208.80.154.243:3311 (and all the way to :3318) - T346947 [16:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:43] RECOVERY - PyBal IPVS diff check on lvs1018 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:25:06] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:988658|Undeploy Listings extension part III (T253216)]] (duration: 24m 06s) [16:25:14] T253216: Undeploy Extension:Listings from Wikimedia Production - https://phabricator.wikimedia.org/T253216 [16:25:14] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqsin and not P{cp[5030,5032].eqsin.wmnet} and A:cp [16:25:26] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:25:36] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10cmooney) @papaul let me know what port is used on lsw1-b8-codfw once done and I will make the Netbox changes and assign new IPs f... [16:25:44] (03PS1) 10FNegri: dologmsg: standarize logging format [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) [16:26:03] (03PS2) 10Majavah: hieradata: remove wikireplica service catalog entries [puppet] - 10https://gerrit.wikimedia.org/r/988483 (https://phabricator.wikimedia.org/T346947) [16:27:13] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:27:38] (03CR) 10Ssingh: [C: 03+1] hieradata: remove wikireplica service catalog entries [puppet] - 10https://gerrit.wikimedia.org/r/988483 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [16:27:50] 10SRE-OnFire, 10Znuny, 10collaboration-services: ticket.wikimedia.org should page when down - https://phabricator.wikimedia.org/T354479 (10Jelto) p:05Triage→03Medium Thanks for opening the task! We will pick that topic up in our next team meeting. [16:28:26] (03CR) 10Majavah: [C: 03+2] hieradata: remove wikireplica service catalog entries [puppet] - 10https://gerrit.wikimedia.org/r/988483 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [16:29:03] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:29:34] (03PS1) 10Majavah: O:mariadb::proxy: remove LVS realserver profile [puppet] - 10https://gerrit.wikimedia.org/r/988670 (https://phabricator.wikimedia.org/T346947) [16:30:03] (03PS2) 10Majavah: O:mariadb::proxy: remove LVS realserver profile [puppet] - 10https://gerrit.wikimedia.org/r/988670 (https://phabricator.wikimedia.org/T346947) [16:30:05] jan_drewniak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240108T1630). [16:30:46] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [16:30:47] (03PS3) 10Majavah: O:mariadb::proxy::replicas: remove LVS realserver profile [puppet] - 10https://gerrit.wikimedia.org/r/988670 (https://phabricator.wikimedia.org/T346947) [16:31:01] (03CR) 10Ssingh: [C: 03+1] "Same as durum and already deployed there, so looks good." [puppet] - 10https://gerrit.wikimedia.org/r/984836 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [16:31:37] (03CR) 10Btullis: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/988670 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [16:32:04] (03CR) 10Majavah: [C: 03+2] O:mariadb::proxy::replicas: remove LVS realserver profile [puppet] - 10https://gerrit.wikimedia.org/r/988670 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [16:32:13] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:32:31] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2033.codfw.wmnet decommissioned, removing all IPs except the asset tag one - pt1979@cumin2002" [16:33:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2033.codfw.wmnet decommissioned, removing all IPs except the asset tag one - pt1979@cumin2002" [16:33:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:33:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2033.codfw.wmnet [16:34:07] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove unwanted AAAA records from new dbstore hosts - btullis@cumin1002" [16:35:00] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove unwanted AAAA records from new dbstore hosts - btullis@cumin1002" [16:35:00] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:35:01] PROBLEM - cassandra-b service on restbase2035 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:35:35] PROBLEM - Disk space on mw2259 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=84%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2259&var-datasource=codfw+prometheus/ops [16:35:51] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1377.eqiad.wmnet with OS bullseye [16:36:21] !log btullis@cumin1002 START - Cookbook sre.dns.wipe-cache dbstore1008.eqiad.wmnet on all recursors [16:36:24] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dbstore1008.eqiad.wmnet on all recursors [16:36:50] (03PS1) 10Btullis: Revert "Revert "Switch s7-analytics-replica to dbstore1008"" [dns] - 10https://gerrit.wikimedia.org/r/988265 [16:37:04] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988673 (https://phabricator.wikimedia.org/T128546) [16:37:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti2034.codfw.wmnet [16:37:21] PROBLEM - cassandra-c CQL 10.192.48.239:9042 on restbase2035 is CRITICAL: connect to address 10.192.48.239 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [16:39:47] PROBLEM - cassandra-c SSL 10.192.48.239:7000 on restbase2035 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [16:40:58] (03CR) 10Vgutierrez: [C: 03+1] "thanks for taking care of this!" [puppet] - 10https://gerrit.wikimedia.org/r/984836 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [16:41:06] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:988658|Undeploy Listings extension part III (T253216)]] [16:41:09] T253216: Undeploy Extension:Listings from Wikimedia Production - https://phabricator.wikimedia.org/T253216 [16:42:17] PROBLEM - cassandra-c service on restbase2035 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:42:18] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988673 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:42:47] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:988658|Undeploy Listings extension part III (T253216)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:42:55] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:43:03] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988673 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:43:32] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [16:44:02] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqsin and not P{cp[5030,5032].eqsin.wmnet} and A:cp [16:46:05] oh hi ladsgroup! let me know when your sync is done (I'll do a portal update afterwards) [16:46:54] oh sorry [16:46:57] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqsin and A:cp [16:46:59] https://www.irccloud.com/pastebin/Fju2tGn0/ [16:48:00] PROBLEM - Disk space on mw2259 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=84%) [16:48:16] (03CR) 10Btullis: [C: 03+2] Revert "Revert "Switch s7-analytics-replica to dbstore1008"" [dns] - 10https://gerrit.wikimedia.org/r/988265 (owner: 10Btullis) [16:48:46] that host has lots of old trains on disk [16:48:48] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2034.codfw.wmnet decommissioned, removing all IPs except the asset tag one - pt1979@cumin2002" [16:49:25] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1377.eqiad.wmnet with reason: host reimage [16:49:53] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:988658|Undeploy Listings extension part III (T253216)]] (duration: 08m 47s) [16:49:57] T253216: Undeploy Extension:Listings from Wikimedia Production - https://phabricator.wikimedia.org/T253216 [16:51:44] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:988658|Undeploy Listings extension part III (T253216)]] [16:51:50] taavi: I cleaned it, it should be fine now [16:52:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1377.eqiad.wmnet with reason: host reimage [16:52:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2034.codfw.wmnet decommissioned, removing all IPs except the asset tag one - pt1979@cumin2002" [16:52:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:52:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2034.codfw.wmnet [16:53:16] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:988658|Undeploy Listings extension part III (T253216)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:53:57] PROBLEM - cassandra-a CQL 10.192.48.237:9042 on restbase2035 is CRITICAL: connect to address 10.192.48.237 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [16:54:30] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw1377.eqiad.wmnet with OS bullseye [16:55:30] (03PS1) 10Peter Fischer: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/988675 (https://phabricator.wikimedia.org/T353460) [16:55:37] RECOVERY - Disk space on mw2259 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2259&var-datasource=codfw+prometheus/ops [16:56:23] (03CR) 10Peter Fischer: [C: 03+2] Search update pipeline: 3rd batch page_rerender [deployment-charts] - 10https://gerrit.wikimedia.org/r/988500 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [16:56:32] (03CR) 10Peter Fischer: [C: 03+2] Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/988675 (https://phabricator.wikimedia.org/T353460) (owner: 10Peter Fischer) [16:56:52] (03PS1) 10Kamila Součková: mw1377: change role to insetup [puppet] - 10https://gerrit.wikimedia.org/r/988676 (https://phabricator.wikimedia.org/T354413) [16:56:59] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [16:57:15] (03Merged) 10jenkins-bot: Search update pipeline: 3rd batch page_rerender [deployment-charts] - 10https://gerrit.wikimedia.org/r/988500 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [16:57:23] 10SRE, 10Infrastructure-Foundations, 10Privacy Engineering, 10Security-Team, and 3 others: netbox.wikimedia.org/metrics and netbox-next.wikimedia.org/metrics publicly expose prometheus and python metrics - https://phabricator.wikimedia.org/T318838 (10sbassett) Since the specific issue within this task appe... [16:57:25] (03Merged) 10jenkins-bot: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/988675 (https://phabricator.wikimedia.org/T353460) (owner: 10Peter Fischer) [16:58:08] 10SRE, 10Infrastructure-Foundations, 10Privacy Engineering, 10Security-Team, and 3 others: netbox.wikimedia.org/metrics and netbox-next.wikimedia.org/metrics publicly expose prometheus and python metrics - https://phabricator.wikimedia.org/T318838 (10sbassett) p:05Triage→03Low [16:58:32] (03CR) 10Kamila Součková: [C: 03+2] mw1377: change role to insetup [puppet] - 10https://gerrit.wikimedia.org/r/988676 (https://phabricator.wikimedia.org/T354413) (owner: 10Kamila Součková) [16:59:02] (03CR) 10Bking: [C: 03+2] team-search-platform: Update job queue alerts to use histogram [alerts] - 10https://gerrit.wikimedia.org/r/987206 (owner: 10Ebernhardson) [16:59:25] (03CR) 10Bking: [V: 03+1 C: 03+2] team-search-platform: Update job queue alerts to use histogram [alerts] - 10https://gerrit.wikimedia.org/r/987206 (owner: 10Ebernhardson) [17:00:00] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host mw1377.eqiad.wmnet with OS bullseye [17:00:17] (03Merged) 10jenkins-bot: team-search-platform: Update job queue alerts to use histogram [alerts] - 10https://gerrit.wikimedia.org/r/987206 (owner: 10Ebernhardson) [17:00:51] PROBLEM - cassandra-b CQL 10.192.48.238:9042 on restbase2035 is CRITICAL: connect to address 10.192.48.238 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [17:01:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] update eventstream helm values.yaml file to include hard-coded list of redacted pages (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (owner: 10Htriedman) [17:02:04] (03PS2) 10FNegri: dologmsg: standarize logging format [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) [17:02:19] (03PS3) 10FNegri: dologmsg: standardize logging format [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) [17:02:45] PROBLEM - Disk space on mw2267 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=84%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2267&var-datasource=codfw+prometheus/ops [17:03:21] PROBLEM - cassandra-b SSL 10.192.48.238:7000 on restbase2035 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [17:04:08] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:988658|Undeploy Listings extension part III (T253216)]] (duration: 12m 24s) [17:04:16] T253216: Undeploy Extension:Listings from Wikimedia Production - https://phabricator.wikimedia.org/T253216 [17:04:39] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:988658|Undeploy Listings extension part III (T253216)]] [17:05:24] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [17:06:08] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:988658|Undeploy Listings extension part III (T253216)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:06:15] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [17:06:39] (03PS1) 10Slyngshede: Clearify that the user needs to provide shell account. [software/bitu] - 10https://gerrit.wikimedia.org/r/988677 (https://phabricator.wikimedia.org/T338825) [17:06:40] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:07:59] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [17:08:05] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:11:07] (03PS46) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [17:11:09] (03PS1) 10AOkoth: vrts: enable connection pooling [puppet] - 10https://gerrit.wikimedia.org/r/988679 [17:12:28] (03PS2) 10Btullis: Switch s5-analytics-replica to dbstore1008 [dns] - 10https://gerrit.wikimedia.org/r/987426 (https://phabricator.wikimedia.org/T351921) [17:12:31] (03CR) 10CI reject: [V: 04-1] vrts: enable connection pooling [puppet] - 10https://gerrit.wikimedia.org/r/988679 (owner: 10AOkoth) [17:12:40] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:988658|Undeploy Listings extension part III (T253216)]] (duration: 08m 01s) [17:12:54] T253216: Undeploy Extension:Listings from Wikimedia Production - https://phabricator.wikimedia.org/T253216 [17:13:35] finally the scap is finished without any issues [17:14:10] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqsin and A:cp [17:14:40] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1377.eqiad.wmnet with reason: host reimage [17:14:51] (03CR) 10Btullis: [C: 03+2] Switch s5-analytics-replica to dbstore1008 [dns] - 10https://gerrit.wikimedia.org/r/987426 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis) [17:15:15] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [17:15:58] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:17:25] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1377.eqiad.wmnet with reason: host reimage [17:17:48] (03PS2) 10Btullis: Switch s1-analytics-replica to dbstore1008 [dns] - 10https://gerrit.wikimedia.org/r/987427 (https://phabricator.wikimedia.org/T351921) [17:18:58] !log wipe prometheus@k8s eqiad WAL and restart - T354399 [17:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:20] T354399: Prometheus @ k8s OOM loop - https://phabricator.wikimedia.org/T354399 [17:21:46] (03CR) 10Btullis: [C: 03+2] Switch s1-analytics-replica to dbstore1008 [dns] - 10https://gerrit.wikimedia.org/r/987427 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis) [17:21:48] (03PS4) 10Effie Mouzeli: (WIP) mcrouter vanilla chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/981461 [17:22:52] RECOVERY - Disk space on mw2267 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2267&var-datasource=codfw+prometheus/ops [17:27:45] (03PS2) 10AOkoth: vrts: enable connection pooling [puppet] - 10https://gerrit.wikimedia.org/r/988679 [17:31:26] (03CR) 10FNegri: "The modified script is available as `/home/fnegri/dologmsg` in tools-sgebastion-10 if you want to test it." [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) (owner: 10FNegri) [17:34:12] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1377.eqiad.wmnet with OS bullseye [17:35:00] (03PS4) 10FNegri: dologmsg: standardize logging format [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) [17:36:51] !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:988673| Bumping portals to master (T128546)]] (duration: 06m 21s) [17:36:54] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [17:39:18] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2083:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [17:43:08] !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:988673| Bumping portals to master (T128546)]] (duration: 06m 17s) [17:43:12] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [17:45:12] (03PS3) 10AOkoth: vrts: enable connection pooling [puppet] - 10https://gerrit.wikimedia.org/r/988679 (https://phabricator.wikimedia.org/T354484) [17:47:35] (03CR) 10Htriedman: update eventstream helm values.yaml file to include hard-coded list of redacted pages (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (owner: 10Htriedman) [17:49:41] (03PS5) 10FNegri: dologmsg: standardize logging format [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) [17:52:20] RECOVERY - Host mw2394 is UP: PING OK - Packet loss = 0%, RTA = 30.33 ms [17:53:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mw2394.mgmt.codfw.wmnet with reboot policy GRACEFUL [17:54:56] (03CR) 10Marostegui: [C: 03+1] Disable monitoring on dbstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/987420 (https://phabricator.wikimedia.org/T351921) (owner: 10Btullis) [17:56:22] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10fnegri) p:05Triage→03High [17:56:32] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2394.mgmt.codfw.wmnet with reboot policy GRACEFUL [17:56:44] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10fnegri) 05Open→03In progress [17:57:00] (03CR) 10CI reject: [V: 04-1] dologmsg: standardize logging format [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) (owner: 10FNegri) [17:57:22] PROBLEM - Host mw2394 is DOWN: PING CRITICAL - Packet loss = 100% [17:58:02] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Papaul) @cmooney xe-0/0/26 [17:58:54] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Papaul) [17:59:04] (03CR) 10FNegri: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) (owner: 10FNegri) [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240108T1800) [18:00:05] ryankemper: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240108T1800). [18:00:36] RECOVERY - Host mw2394 is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms [18:01:03] 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Papaul) mainboard repalced by @Jhancock.wm . She is running the provision cookbook now. [18:01:12] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Disk (sdh) failed in ms-be2068 - https://phabricator.wikimedia.org/T354180 (10Papaul) Waiting to received the replacement disk before closing the task. [18:02:35] (CalicoTyphaDown) firing: Too few (0) calico-typha replicas running - https://wikitech.wikimedia.org/wiki/Calico#Typha" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoTyphaDown [18:03:06] PROBLEM - Host mw2394 is DOWN: PING CRITICAL - Packet loss = 100% [18:03:26] (KubernetesCalicoDown) firing: (3) kubestage2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:03:26] (CalicoKubeControllersDown) firing: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [18:03:47] (03PS1) 10Majavah: Move dbproxy1018/9 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/988681 (https://phabricator.wikimedia.org/T346947) [18:07:35] (CalicoTyphaDown) resolved: Too few (0) calico-typha replicas running - https://wikitech.wikimedia.org/wiki/Calico#Typha" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoTyphaDown [18:08:26] (KubernetesCalicoDown) firing: (3) kubestage2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:08:26] (CalicoKubeControllersDown) resolved: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [18:08:42] RECOVERY - Host mw2394 is UP: PING OK - Packet loss = 0%, RTA = 31.16 ms [18:10:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mw2394.mgmt.codfw.wmnet with reboot policy GRACEFUL [18:11:13] (03PS1) 10Bking: regex.yaml: Fix regex for elastic2083 [puppet] - 10https://gerrit.wikimedia.org/r/988682 (https://phabricator.wikimedia.org/T354543) [18:12:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2394.mgmt.codfw.wmnet with reboot policy GRACEFUL [18:12:30] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/988682 (https://phabricator.wikimedia.org/T354543) (owner: 10Bking) [18:18:44] (03CR) 10DCausse: [C: 03+1] regex.yaml: Fix regex for elastic2083 [puppet] - 10https://gerrit.wikimedia.org/r/988682 (https://phabricator.wikimedia.org/T354543) (owner: 10Bking) [18:19:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mw2394.mgmt.codfw.wmnet with reboot policy GRACEFUL [18:21:15] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2394.mgmt.codfw.wmnet with reboot policy GRACEFUL [18:23:42] (03CR) 10Bking: [C: 03+2] regex.yaml: Fix regex for elastic2083 [puppet] - 10https://gerrit.wikimedia.org/r/988682 (https://phabricator.wikimedia.org/T354543) (owner: 10Bking) [18:25:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mw2394.mgmt.codfw.wmnet with reboot policy GRACEFUL [18:26:29] (03PS1) 10Arlolra: Switch testreduce to 1002 [dns] - 10https://gerrit.wikimedia.org/r/988684 (https://phabricator.wikimedia.org/T345220) [18:26:53] (03CR) 10Ssingh: druid: remove druid100[4-6] from druid_public_broker VIP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974120 (owner: 10Stevemunene) [18:27:04] (03CR) 10Dzahn: "would this script be executed just because it's in that directory or does it need to actually be run?" [puppet] - 10https://gerrit.wikimedia.org/r/988679 (https://phabricator.wikimedia.org/T354484) (owner: 10AOkoth) [18:27:29] (03CR) 10Dzahn: [C: 03+2] "!:)" [labs/private] - 10https://gerrit.wikimedia.org/r/988084 (https://phabricator.wikimedia.org/T84536) (owner: 10Dzahn) [18:27:36] removes pmtpa stuff :p [18:27:42] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2394.mgmt.codfw.wmnet with reboot policy GRACEFUL [18:27:55] (03CR) 10Dzahn: [V: 03+2 C: 03+2] secret: delete fake keys for hosts in Tampa(!) [labs/private] - 10https://gerrit.wikimedia.org/r/988084 (https://phabricator.wikimedia.org/T84536) (owner: 10Dzahn) [18:29:56] (03CR) 10Subramanya Sastry: [C: 04-1] "Needs review by Daniel & Moritz since Moritz had done this previously and we ran into some problems (I forget the details) and reverted it" [dns] - 10https://gerrit.wikimedia.org/r/988684 (https://phabricator.wikimedia.org/T345220) (owner: 10Arlolra) [18:34:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mw2394.mgmt.codfw.wmnet with reboot policy GRACEFUL [18:35:51] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2394.mgmt.codfw.wmnet with reboot policy GRACEFUL [18:37:10] (03CR) 10Dzahn: "I don't know why it was reverted, I was in sabbatical back then and meanwhile the parsoid test hosts are maintained by serviceops team." [dns] - 10https://gerrit.wikimedia.org/r/988684 (https://phabricator.wikimedia.org/T345220) (owner: 10Arlolra) [18:39:38] (03CR) 10Dzahn: [C: 03+2] alerting: replace serviceops-collab with new team name [puppet] - 10https://gerrit.wikimedia.org/r/987488 (owner: 10Dzahn) [18:40:13] (03CR) 10Subramanya Sastry: "After this change, the endpoint failed to resolve ... but, we could reapply / retry just in case it was something transient back then." [dns] - 10https://gerrit.wikimedia.org/r/988684 (https://phabricator.wikimedia.org/T345220) (owner: 10Arlolra) [18:44:59] (03CR) 10Ssingh: "Looks good overall and we can deploy this tomorrow. Sorry for the two different reviews, I should have addressed this in the previous one." [puppet] - 10https://gerrit.wikimedia.org/r/974120 (owner: 10Stevemunene) [18:49:02] (PuppetZeroResources) resolved: Puppet has failed generate resources on elastic2083:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [18:49:11] (03CR) 10Subramanya Sastry: [C: 03+1] "Moritz/Daniel: Could we retry this and see if it works this time?" [dns] - 10https://gerrit.wikimedia.org/r/988684 (https://phabricator.wikimedia.org/T345220) (owner: 10Arlolra) [18:51:00] (03CR) 10Bking: [C: 03+2] aptrepo: add Elastic-related components to bookworm repo [puppet] - 10https://gerrit.wikimedia.org/r/987859 (https://phabricator.wikimedia.org/T353392) (owner: 10Bking) [18:52:35] (03CR) 10Ssingh: Switch testreduce to 1002 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/988684 (https://phabricator.wikimedia.org/T345220) (owner: 10Arlolra) [18:59:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mw2394.mgmt.codfw.wmnet with reboot policy GRACEFUL [19:01:43] (03CR) 10Arlolra: Switch testreduce to 1002 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/988684 (https://phabricator.wikimedia.org/T345220) (owner: 10Arlolra) [19:03:40] (03CR) 10Ssingh: [V: 03+2 C: 03+2] Switch testreduce to 1002 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/988684 (https://phabricator.wikimedia.org/T345220) (owner: 10Arlolra) [19:04:19] !log running authdns-update for CR 988684: T336043 [19:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:23] T336043: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 [19:04:29] !log running authdns-update for CR 988684: T345220 [19:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:33] T345220: Upgrade nodejs on testreduce1001 - https://phabricator.wikimedia.org/T345220 [19:08:18] (03CR) 10Dzahn: [V: 03+2 C: 03+2] secret: remove passwords and fake key for ganglia [labs/private] - 10https://gerrit.wikimedia.org/r/988085 (https://phabricator.wikimedia.org/T253555) (owner: 10Dzahn) [19:08:40] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2394.mgmt.codfw.wmnet with reboot policy GRACEFUL [19:08:50] (03PS4) 10Dzahn: secret: remove passwords and fake key for ganglia [labs/private] - 10https://gerrit.wikimedia.org/r/988085 (https://phabricator.wikimedia.org/T253555) [19:09:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mw2394.mgmt.codfw.wmnet with reboot policy GRACEFUL [19:11:43] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2394.mgmt.codfw.wmnet with reboot policy GRACEFUL [19:13:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mw2394.mgmt.codfw.wmnet with reboot policy GRACEFUL [19:15:05] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2394.mgmt.codfw.wmnet with reboot policy GRACEFUL [19:18:38] (03CR) 10AOkoth: vrts: enable connection pooling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/988679 (https://phabricator.wikimedia.org/T354484) (owner: 10AOkoth) [19:19:13] (03CR) 10Dzahn: [V: 03+2] secret: remove passwords and fake key for ganglia [labs/private] - 10https://gerrit.wikimedia.org/r/988085 (https://phabricator.wikimedia.org/T253555) (owner: 10Dzahn) [19:21:15] (03CR) 10Arlolra: "Noting that this was deployed and is back to 502'ing" [dns] - 10https://gerrit.wikimedia.org/r/988684 (https://phabricator.wikimedia.org/T345220) (owner: 10Arlolra) [19:23:50] PROBLEM - Host mw2394 is DOWN: PING CRITICAL - Packet loss = 100% [19:27:38] !log make puppet re-generate empty envoy config file on testreduce1002 T345220 [19:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:41] T345220: Upgrade nodejs on testreduce1001 - https://phabricator.wikimedia.org/T345220 [19:28:40] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2394.mgmt.codfw.wmnet with reboot policy FORCED [19:30:27] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2394.mgmt.codfw.wmnet with reboot policy FORCED [19:34:24] (03CR) 10Majavah: Switch testreduce to 1002 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/988684 (https://phabricator.wikimedia.org/T345220) (owner: 10Arlolra) [19:36:05] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmde, nda for Dima Koushha - https://phabricator.wikimedia.org/T354276 (10KFrancis) Hi all, please provide Dima koushha's WMDE email address to kfrancis@wikimedia.org and I'll prepare the NDA. Thanks! [19:41:56] RECOVERY - Check systemd state on ms-be2069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:50:16] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [19:54:10] (KubernetesRsyslogDown) firing: (3) rsyslog on mw1380:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:58:51] (03PS1) 10Clare Ming: Remove android.metrics_platform.* stream definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988714 (https://phabricator.wikimedia.org/T354199) [20:01:24] (03PS1) 10Andrew Bogott: Trove: switch mount path for storage volume to /sdb [puppet] - 10https://gerrit.wikimedia.org/r/988715 [20:06:27] (03CR) 10Andrew Bogott: [C: 03+2] Trove: switch mount path for storage volume to /sdb [puppet] - 10https://gerrit.wikimedia.org/r/988715 (owner: 10Andrew Bogott) [20:17:08] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/988106/1046/" [puppet] - 10https://gerrit.wikimedia.org/r/988106 (owner: 10Dzahn) [20:17:59] (03CR) 10Sharvaniharan: [C: 03+1] "Looks good to me! 🎉" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988714 (https://phabricator.wikimedia.org/T354199) (owner: 10Clare Ming) [20:22:55] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "complete noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/988106 (owner: 10Dzahn) [20:24:26] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:24:32] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:27:24] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51305 bytes in 0.110 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:27:30] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.268 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:34:10] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:36:08] (03PS2) 10Dzahn: phabricator: move prometheus smtp check to monitoring class [puppet] - 10https://gerrit.wikimedia.org/r/988116 [20:41:13] (03PS2) 10DDesouza: research-landing-page: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/988131 (https://phabricator.wikimedia.org/T352583) [20:47:45] 10SRE, 10Observability-Alerting: Reminders for unhandled/unacked alerts - https://phabricator.wikimedia.org/T307958 (10lmata) >>! In T307958#9390852, @fgiunchedi wrote: > This is essentially what https://alerts.wikimedia.org/triage/ displays now, for `hide_alerts_older_than: '1200h'` alerts. The app also offer... [20:48:32] (03PS1) 10Eevans: restbase: partitioning and insetup for restbase10[34-42] [puppet] - 10https://gerrit.wikimedia.org/r/988728 (https://phabricator.wikimedia.org/T354227) [20:51:36] (03CR) 10Dzahn: [C: 03+2] phabricator: move prometheus smtp check to monitoring class [puppet] - 10https://gerrit.wikimedia.org/r/988116 (owner: 10Dzahn) [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240108T2100) [21:00:04] cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:01:34] i will deploy! [21:04:02] RECOVERY - cassandra-a CQL 10.192.48.237:9042 on restbase2035 is OK: TCP OK - 0.030 second response time on 10.192.48.237 port 9042 https://phabricator.wikimedia.org/T93886 [21:04:48] RECOVERY - cassandra-b service on restbase2035 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:05:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988714 (https://phabricator.wikimedia.org/T354199) (owner: 10Clare Ming) [21:06:18] RECOVERY - cassandra-b SSL 10.192.48.238:7000 on restbase2035 is OK: SSL OK - Certificate restbase2035-b valid until 2025-12-07 21:03:45 +0000 (expires in 698 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [21:06:50] (03Merged) 10jenkins-bot: Remove android.metrics_platform.* stream definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988714 (https://phabricator.wikimedia.org/T354199) (owner: 10Clare Ming) [21:07:05] !log cjming@deploy2002 Started scap: Backport for [[gerrit:988714|Remove android.metrics_platform.* stream definitions (T354199)]] [21:07:10] T354199: Java MPC shouldn't broadcast events to multiple streams - https://phabricator.wikimedia.org/T354199 [21:08:33] !log cjming@deploy2002 cjming: Backport for [[gerrit:988714|Remove android.metrics_platform.* stream definitions (T354199)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:08:58] !log cjming@deploy2002 cjming: Continuing with sync [21:14:38] RECOVERY - Host mw2394 is UP: PING OK - Packet loss = 0%, RTA = 31.72 ms [21:15:22] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:988714|Remove android.metrics_platform.* stream definitions (T354199)]] (duration: 08m 17s) [21:15:26] T354199: Java MPC shouldn't broadcast events to multiple streams - https://phabricator.wikimedia.org/T354199 [21:15:53] (03CR) 10Arlolra: Switch testreduce to 1002 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/988684 (https://phabricator.wikimedia.org/T345220) (owner: 10Arlolra) [21:18:43] my patch was the only one in the queue -- i'll hang out for a few more minutes in case anyone else needs something before closing the window [21:21:02] (03CR) 10Dzahn: [C: 03+2] research-landing-page: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/988131 (https://phabricator.wikimedia.org/T352583) (owner: 10DDesouza) [21:21:54] (03CR) 10Dzahn: [C: 03+2] phabricator weekly changes email: Exclude listing some WMCS team tags [puppet] - 10https://gerrit.wikimedia.org/r/987141 (owner: 10Aklapper) [21:22:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1019:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [21:24:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2394.mgmt.codfw.wmnet with reboot policy FORCED [21:25:03] (03CR) 10Dzahn: [C: 03+2] phabricator weekly changes email: Explain why some queries are listed [puppet] - 10https://gerrit.wikimedia.org/r/987143 (owner: 10Aklapper) [21:27:40] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2394.mgmt.codfw.wmnet with reboot policy FORCED [21:29:56] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2394.mgmt.codfw.wmnet with reboot policy FORCED [21:30:31] (03CR) 10Dzahn: [C: 03+2] "I haven't manually triggered a mail this time. Knowing you can test it." [puppet] - 10https://gerrit.wikimedia.org/r/987143 (owner: 10Aklapper) [21:32:06] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2394.mgmt.codfw.wmnet with reboot policy FORCED [21:37:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1019:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [21:37:47] !log end of UTC late backport window [21:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:33] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [21:50:09] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:00:04] Reedy, sbassett, Maryum, and manfredi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Weekly Security deployment window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240108T2200). [22:04:46] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with reboot policy GRACEFUL [22:08:04] PROBLEM - Host sretest2003 is DOWN: PING CRITICAL - Packet loss = 100% [22:09:10] (KubernetesCalicoDown) firing: kubestage2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [22:12:04] RECOVERY - Host sretest2003 is UP: PING OK - Packet loss = 0%, RTA = 30.32 ms [22:20:55] (03PS1) 10Bking: elastic: assign prod role to elastic2088 [puppet] - 10https://gerrit.wikimedia.org/r/988735 (https://phabricator.wikimedia.org/T353392) [22:21:33] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/988735 (https://phabricator.wikimedia.org/T353392) (owner: 10Bking) [22:21:35] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/988735 (https://phabricator.wikimedia.org/T353392) (owner: 10Bking) [22:22:13] (03CR) 10CI reject: [V: 04-1] elastic: assign prod role to elastic2088 [puppet] - 10https://gerrit.wikimedia.org/r/988735 (https://phabricator.wikimedia.org/T353392) (owner: 10Bking) [22:22:59] (03PS2) 10Bking: elastic: assign prod role to elastic2088 [puppet] - 10https://gerrit.wikimedia.org/r/988735 (https://phabricator.wikimedia.org/T353392) [22:23:35] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/988735 (https://phabricator.wikimedia.org/T353392) (owner: 10Bking) [22:29:10] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:29:33] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: rsync::server::module installs an rsync server even when $ensure is absent - https://phabricator.wikimedia.org/T311066 (10jhathaway) 05Open→03Resolved @MatthewVernon & @Clement_Goubert rsync server has been converted to concat and the... [22:30:45] !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=yes; selector: name=elastic2087\.codfw\.wmnet [22:32:23] (03PS3) 10Bking: elastic: assign prod role to elastic2088 [puppet] - 10https://gerrit.wikimedia.org/r/988735 (https://phabricator.wikimedia.org/T353392) [22:37:36] (03PS4) 10Ryan Kemper: elastic: assign prod role to elastic2088 [puppet] - 10https://gerrit.wikimedia.org/r/988735 (https://phabricator.wikimedia.org/T353392) (owner: 10Bking) [22:56:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2003.mgmt.codfw.wmnet with reboot policy GRACEFUL [22:57:46] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2394.mgmt.codfw.wmnet with reboot policy FORCED [22:58:30] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2394.mgmt.codfw.wmnet with reboot policy FORCED [23:20:25] (03PS3) 10EoghanGaffney: [vrts] Adjust restart and oom policy for clamav and vrts services [puppet] - 10https://gerrit.wikimedia.org/r/988739 (https://phabricator.wikimedia.org/T354478) [23:50:16] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [23:55:16] (KubernetesRsyslogDown) firing: (3) rsyslog on mw1380:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown