[00:08:12] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P53582 and previous config saved to /var/cache/conftool/dbconfig/20231120-000811-arnaudb.json [00:10:36] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:13:21] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:23:18] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P53583 and previous config saved to /var/cache/conftool/dbconfig/20231120-002317-arnaudb.json [00:38:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T348183)', diff saved to https://phabricator.wikimedia.org/P53584 and previous config saved to /var/cache/conftool/dbconfig/20231120-003824-arnaudb.json [00:38:27] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [00:38:29] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [00:38:40] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [00:38:46] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2119 (T348183)', diff saved to https://phabricator.wikimedia.org/P53585 and previous config saved to /var/cache/conftool/dbconfig/20231120-003846-arnaudb.json [00:38:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/974645 [00:38:53] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/974645 (owner: 10TrainBranchBot) [00:43:10] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:58:22] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/974645 (owner: 10TrainBranchBot) [01:15:02] (03PS4) 10Pppery: Merge in changes to qqq.json rather than overwriting them [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975392 (https://phabricator.wikimedia.org/T351363) [02:19:07] (03PS3) 10Pppery: Run generate.php and arc liberate [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975413 (https://phabricator.wikimedia.org/T318763) [02:38:22] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:08:21] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [03:08:22] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:28:21] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:30:45] (03CR) 10Pppery: Merge in changes to qqq.json rather than overwriting them (031 comment) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975392 (https://phabricator.wikimedia.org/T351363) (owner: 10Pppery) [03:39:16] (PuppetFailure) firing: Puppet has failed on ml-serve1008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:40:48] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:13:22] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:33:47] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T348183)', diff saved to https://phabricator.wikimedia.org/P53586 and previous config saved to /var/cache/conftool/dbconfig/20231120-053347-arnaudb.json [05:33:52] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [05:48:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P53587 and previous config saved to /var/cache/conftool/dbconfig/20231120-054853-arnaudb.json [06:04:00] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P53588 and previous config saved to /var/cache/conftool/dbconfig/20231120-060400-arnaudb.json [06:12:10] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: WIP [06:12:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: WIP [06:19:06] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T348183)', diff saved to https://phabricator.wikimedia.org/P53589 and previous config saved to /var/cache/conftool/dbconfig/20231120-061906-arnaudb.json [06:19:08] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [06:19:11] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [06:19:22] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [06:19:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2136 (T348183)', diff saved to https://phabricator.wikimedia.org/P53590 and previous config saved to /var/cache/conftool/dbconfig/20231120-061928-arnaudb.json [06:25:18] !log installing qemu security updates on bullseye [06:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:23] (03PS1) 10Muehlenhoff: mediawiki::packages: Clean up absented packages [puppet] - 10https://gerrit.wikimedia.org/r/975451 [06:40:38] ACKNOWLEDGEMENT - snapshot of s1 in eqiad on backupmon1001 is CRITICAL: snapshot for s1 at eqiad (db1140) taken more than 3 days ago: Most recent backup 2023-11-16 02:32:57 Marostegui https://phabricator.wikimedia.org/T351617 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [06:40:38] ACKNOWLEDGEMENT - snapshot of s2 in codfw on backupmon1001 is CRITICAL: snapshot for s2 at codfw (db2097) taken more than 3 days ago: Most recent backup 2023-11-16 03:47:45 Marostegui https://phabricator.wikimedia.org/T351617 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [06:40:38] ACKNOWLEDGEMENT - snapshot of s2 in eqiad on backupmon1001 is CRITICAL: snapshot for s2 at eqiad (db1225) taken more than 3 days ago: Most recent backup 2023-11-16 04:05:48 Marostegui https://phabricator.wikimedia.org/T351617 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [06:40:38] ACKNOWLEDGEMENT - snapshot of s3 in codfw on backupmon1001 is CRITICAL: snapshot for s3 at codfw (db2139) taken more than 3 days ago: Most recent backup 2023-11-16 06:43:49 Marostegui https://phabricator.wikimedia.org/T351617 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [06:40:38] ACKNOWLEDGEMENT - snapshot of s3 in eqiad on backupmon1001 is CRITICAL: snapshot for s3 at eqiad (db1150) taken more than 3 days ago: Most recent backup 2023-11-16 06:31:49 Marostegui https://phabricator.wikimedia.org/T351617 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [06:40:38] ACKNOWLEDGEMENT - snapshot of s4 in codfw on backupmon1001 is CRITICAL: snapshot for s4 at codfw (db2099) taken more than 3 days ago: Most recent backup 2023-11-16 02:45:56 Marostegui https://phabricator.wikimedia.org/T351617 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [06:40:39] ACKNOWLEDGEMENT - snapshot of s4 in eqiad on backupmon1001 is CRITICAL: snapshot for s4 at eqiad (db1150) taken more than 3 days ago: Most recent backup 2023-11-16 02:39:56 Marostegui https://phabricator.wikimedia.org/T351617 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [06:40:39] ACKNOWLEDGEMENT - snapshot of s5 in codfw on backupmon1001 is CRITICAL: snapshot for s5 at codfw (db2101) taken more than 3 days ago: Most recent backup 2023-11-16 05:41:36 Marostegui https://phabricator.wikimedia.org/T351617 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [06:40:40] ACKNOWLEDGEMENT - snapshot of s5 in eqiad on backupmon1001 is CRITICAL: snapshot for s5 at eqiad (db1216) taken more than 3 days ago: Most recent backup 2023-11-16 05:15:57 Marostegui https://phabricator.wikimedia.org/T351617 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [06:40:40] ACKNOWLEDGEMENT - snapshot of s6 in codfw on backupmon1001 is CRITICAL: snapshot for s6 at codfw (db2097) taken more than 3 days ago: Most recent backup 2023-11-16 01:14:47 Marostegui https://phabricator.wikimedia.org/T351617 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [06:40:41] ACKNOWLEDGEMENT - snapshot of s6 in eqiad on backupmon1001 is CRITICAL: snapshot for s6 at eqiad (db1225) taken more than 3 days ago: Most recent backup 2023-11-16 01:10:44 Marostegui https://phabricator.wikimedia.org/T351617 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [06:40:41] ACKNOWLEDGEMENT - snapshot of s7 in codfw on backupmon1001 is CRITICAL: snapshot for s7 at codfw (db2098) taken more than 3 days ago: Most recent backup 2023-11-16 07:18:22 Marostegui https://phabricator.wikimedia.org/T351617 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [06:40:42] ACKNOWLEDGEMENT - snapshot of s7 in eqiad on backupmon1001 is CRITICAL: snapshot for s7 at eqiad (db1171) taken more than 3 days ago: Most recent backup 2023-11-16 07:15:41 Marostegui https://phabricator.wikimedia.org/T351617 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [06:40:42] ACKNOWLEDGEMENT - snapshot of s8 in codfw on backupmon1001 is CRITICAL: snapshot for s8 at codfw (db2098) taken more than 3 days ago: Most recent backup 2023-11-16 03:07:57 Marostegui https://phabricator.wikimedia.org/T351617 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [06:40:43] ACKNOWLEDGEMENT - snapshot of s8 in eqiad on backupmon1001 is CRITICAL: snapshot for s8 at eqiad (db1171) taken more than 3 days ago: Most recent backup 2023-11-16 03:11:09 Marostegui https://phabricator.wikimedia.org/T351617 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [06:40:43] ACKNOWLEDGEMENT - snapshot of x1 in codfw on backupmon1001 is CRITICAL: snapshot for x1 at codfw (db2097) taken more than 3 days ago: Most recent backup 2023-11-16 05:34:33 Marostegui https://phabricator.wikimedia.org/T351617 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [06:40:44] ACKNOWLEDGEMENT - snapshot of x1 in eqiad on backupmon1001 is CRITICAL: snapshot for x1 at eqiad (db1225) taken more than 3 days ago: Most recent backup 2023-11-16 14:37:06 Marostegui https://phabricator.wikimedia.org/T351617 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [06:42:11] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host apt1002.wikimedia.org [06:43:53] (03PS1) 10Muehlenhoff: Switch apt1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975452 (https://phabricator.wikimedia.org/T349619) [06:45:51] (03CR) 10Muehlenhoff: [C: 03+2] Switch apt1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975452 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [06:47:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1210 T351283', diff saved to https://phabricator.wikimedia.org/P53591 and previous config saved to /var/cache/conftool/dbconfig/20231120-064733-root.json [06:47:38] T351283: Compile and package MariaDB 10.6.16 and 10.4.32 - https://phabricator.wikimedia.org/T351283 [06:49:00] (03PS1) 10Marostegui: db1210: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/975453 (https://phabricator.wikimedia.org/T351283) [06:49:32] (03CR) 10Marostegui: [C: 03+2] db1210: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/975453 (https://phabricator.wikimedia.org/T351283) (owner: 10Marostegui) [06:51:55] (03PS3) 10KartikMistry: testwiki: Enable the Unified Content Translation Dashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973170 (https://phabricator.wikimedia.org/T337915) [06:52:27] (03PS1) 10Marostegui: ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975454 (https://phabricator.wikimedia.org/T351284) [06:52:33] jouncebot: next [06:52:33] In 1 hour(s) and 7 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231120T0800) [06:52:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host apt1002.wikimedia.org [06:52:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2013.codfw.wmnet,pc[1013-1014].eqiad.wmnet with reason: Switch [06:53:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2013.codfw.wmnet,pc[1013-1014].eqiad.wmnet with reason: Switch [06:54:00] !log installing python3.7 security updates [06:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:39] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975454 (https://phabricator.wikimedia.org/T351284) (owner: 10Marostegui) [06:55:28] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975454 (https://phabricator.wikimedia.org/T351284) (owner: 10Marostegui) [06:55:36] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [06:56:01] (03PS1) 10Marostegui: pc1013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/975455 (https://phabricator.wikimedia.org/T351284) [06:56:31] (03CR) 10Marostegui: [C: 03+2] pc1013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/975455 (https://phabricator.wikimedia.org/T351284) (owner: 10Marostegui) [06:56:38] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:975454|ProductionServices.php: Promote pc1014 to pc3 master (T351284)]] [06:56:42] T351284: Upgrade pc3 to Debian Bookworm and MariaDB 10.6 - https://phabricator.wikimedia.org/T351284 [06:56:42] (SystemdUnitFailed) firing: export_smart_data_dump.service Failed on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:57:52] PROBLEM - Check systemd state on an-worker1131 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:58:00] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:975454|ProductionServices.php: Promote pc1014 to pc3 master (T351284)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [06:58:42] !log marostegui@deploy2002 marostegui: Continuing with sync [06:59:45] marostegui: OK to go with cxserver update? [07:02:05] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [07:03:56] kart_: go for it! [07:04:09] kart_: still finishing my deploy (just a minute I guess) [07:04:36] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:975454|ProductionServices.php: Promote pc1014 to pc3 master (T351284)]] (duration: 07m 58s) [07:04:36] kart_: done! [07:04:43] T351284: Upgrade pc3 to Debian Bookworm and MariaDB 10.6 - https://phabricator.wikimedia.org/T351284 [07:05:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host pc1013.eqiad.wmnet with OS bookworm [07:07:43] (03PS1) 10Marostegui: Revert "db1210: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/975385 [07:08:21] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [07:08:21] (03CR) 10Marostegui: [C: 03+2] Revert "db1210: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/975385 (owner: 10Marostegui) [07:10:45] marostegui: thanks! [07:11:04] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-11-07-081511-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/972323 (https://phabricator.wikimedia.org/T349118) (owner: 10KartikMistry) [07:13:18] (03Merged) 10jenkins-bot: Update cxserver to 2023-11-07-081511-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/972323 (https://phabricator.wikimedia.org/T349118) (owner: 10KartikMistry) [07:14:59] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [07:15:26] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [07:16:50] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975586 [07:16:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1013.eqiad.wmnet with reason: host reimage [07:17:10] kart_: let me know when you are done (no rush) [07:17:40] (03CR) 10Muehlenhoff: [C: 03+2] Also configure acmechief hosts for initially migrated roles [puppet] - 10https://gerrit.wikimedia.org/r/975254 (owner: 10Muehlenhoff) [07:19:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1013.eqiad.wmnet with reason: host reimage [07:20:00] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T348183)', diff saved to https://phabricator.wikimedia.org/P53592 and previous config saved to /var/cache/conftool/dbconfig/20231120-072000-arnaudb.json [07:20:08] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [07:21:25] marostegui: sure. [07:21:33] thanks! [07:21:46] (03PS1) 10Marostegui: Revert "pc1013: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/975587 [07:22:11] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete Hiera file [puppet] - 10https://gerrit.wikimedia.org/r/975252 (owner: 10Muehlenhoff) [07:22:31] (03CR) 10Muehlenhoff: [C: 03+2] Remove Hiera setting on an-worker1111 [puppet] - 10https://gerrit.wikimedia.org/r/975251 (owner: 10Muehlenhoff) [07:28:23] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:34:27] !log installing ncurses security updates [07:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P53593 and previous config saved to /var/cache/conftool/dbconfig/20231120-073506-arnaudb.json [07:35:40] (03CR) 10Ayounsi: Generate subnet DHCP configuration (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/974998 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [07:37:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc1013.eqiad.wmnet with OS bookworm [07:37:31] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [07:38:08] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [07:39:16] (PuppetFailure) firing: Puppet has failed on ml-serve1008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:41:32] (03CR) 10Marostegui: [C: 03+2] Revert "pc1013: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/975587 (owner: 10Marostegui) [07:41:34] marostegui: Sorry for more wait. I may need to revert above change also. [07:41:45] kart_: no worries [07:44:46] (03PS1) 10KartikMistry: Revert "Update cxserver to 2023-11-07-081511-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/975588 [07:46:11] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [07:46:26] (03CR) 10KartikMistry: [C: 03+2] Revert "Update cxserver to 2023-11-07-081511-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/975588 (owner: 10KartikMistry) [07:47:27] (03Merged) 10jenkins-bot: Revert "Update cxserver to 2023-11-07-081511-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/975588 (owner: 10KartikMistry) [07:48:32] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [07:49:03] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [07:50:13] 10sre-alert-triage, 10Infrastructure-Foundations, 10netops: Alert in need of triage: BGP status (instance cr2-eqdfw) - https://phabricator.wikimedia.org/T351083 (10ayounsi) a:03ayounsi Emailed the 2 networks again. I'll delete the sessions if they don't reply or fix them. [07:50:13] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P53594 and previous config saved to /var/cache/conftool/dbconfig/20231120-075013-arnaudb.json [07:51:11] marostegui: I'm done. Keeping staging not reverted to further investigate. [07:51:16] kart_: Thank you! [07:51:25] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc1014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975586 (owner: 10Marostegui) [07:52:06] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc1014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975586 (owner: 10Marostegui) [07:52:46] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:975586|Revert "ProductionServices.php: Promote pc1014 to pc3 master"]] [07:54:03] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:975586|Revert "ProductionServices.php: Promote pc1014 to pc3 master"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:54:53] !log marostegui@deploy2002 marostegui: Continuing with sync [07:55:19] RECOVERY - Check systemd state on an-worker1131 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:56:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 10%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53595 and previous config saved to /var/cache/conftool/dbconfig/20231120-075615-root.json [07:56:42] (SystemdUnitFailed) resolved: export_smart_data_dump.service Failed on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:57:43] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:04] Amir1 and Urbanecm: Dear deployers, time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231120T0800). [08:00:04] No Gerrit patches in the queue for this window AFAICS. [08:00:38] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:975586|Revert "ProductionServices.php: Promote pc1014 to pc3 master"]] (duration: 07m 52s) [08:04:04] Oh, I've patch to deploy. I added in wrong date :/ [08:04:26] marostegui: Deploying again for config patch.. [08:05:00] (03PS1) 10Arnaudb: bashrc: add a function to quick show info [puppet] - 10https://gerrit.wikimedia.org/r/975746 (https://phabricator.wikimedia.org/T344036) [08:05:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T348183)', diff saved to https://phabricator.wikimedia.org/P53596 and previous config saved to /var/cache/conftool/dbconfig/20231120-080519-arnaudb.json [08:05:21] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [08:05:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973170 (https://phabricator.wikimedia.org/T337915) (owner: 10KartikMistry) [08:05:33] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [08:05:35] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [08:05:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3314 (T348183)', diff saved to https://phabricator.wikimedia.org/P53597 and previous config saved to /var/cache/conftool/dbconfig/20231120-080541-arnaudb.json [08:06:01] (03CR) 10Aqu: Send metrics from Airflow analytics test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [08:06:17] (03PS1) 10Elukey: team-ml: add site to the ORES alert's dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/975736 (https://phabricator.wikimedia.org/T346151) [08:06:49] (03Merged) 10jenkins-bot: testwiki: Enable the Unified Content Translation Dashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973170 (https://phabricator.wikimedia.org/T337915) (owner: 10KartikMistry) [08:07:05] !log kartik@deploy2002 Started scap: Backport for [[gerrit:973170|testwiki: Enable the Unified Content Translation Dashboard (T337915)]] [08:07:15] T337915: Enable the Unified Content Translation Dashboard on a test wiki - https://phabricator.wikimedia.org/T337915 [08:08:20] !log kartik@deploy2002 kartik: Backport for [[gerrit:973170|testwiki: Enable the Unified Content Translation Dashboard (T337915)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:09:10] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [08:09:21] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [08:09:48] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync [08:10:03] (03CR) 10Aqu: Send metrics from Airflow analytics test (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [08:10:12] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync [08:11:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 25%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53598 and previous config saved to /var/cache/conftool/dbconfig/20231120-081120-root.json [08:12:19] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [08:13:13] !log kartik@deploy2002 kartik: Continuing with sync [08:13:22] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:14:39] (03CR) 10Jelto: [C: 03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/975360 (https://phabricator.wikimedia.org/T351329) (owner: 10Dduvall) [08:18:54] !log kartik@deploy2002 Finished scap: Backport for [[gerrit:973170|testwiki: Enable the Unified Content Translation Dashboard (T337915)]] (duration: 11m 49s) [08:19:04] T337915: Enable the Unified Content Translation Dashboard on a test wiki - https://phabricator.wikimedia.org/T337915 [08:22:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1241 (re)pooling @ 5%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53599 and previous config saved to /var/cache/conftool/dbconfig/20231120-082224-arnaudb.json [08:22:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1242 (re)pooling @ 5%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53600 and previous config saved to /var/cache/conftool/dbconfig/20231120-082248-arnaudb.json [08:25:15] (03PS2) 10WMDE-Fisch: Update the list of ReferenceTooltip gadget names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974984 (https://phabricator.wikimedia.org/T351314) [08:25:35] (03CR) 10WMDE-Fisch: Update the list of ReferenceTooltip gadget names (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974984 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch) [08:26:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 50%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53601 and previous config saved to /var/cache/conftool/dbconfig/20231120-082625-root.json [08:26:29] (03CR) 10Marostegui: [C: 03+1] bashrc: add a function to quick show info [puppet] - 10https://gerrit.wikimedia.org/r/975746 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [08:36:36] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s: Make kubelet register new nodes as unschedulable [puppet] - 10https://gerrit.wikimedia.org/r/975258 (owner: 10JMeybohm) [08:37:29] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1241 (re)pooling @ 10%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53602 and previous config saved to /var/cache/conftool/dbconfig/20231120-083729-arnaudb.json [08:37:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1242 (re)pooling @ 10%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53603 and previous config saved to /var/cache/conftool/dbconfig/20231120-083753-arnaudb.json [08:38:10] (03CR) 10JMeybohm: [C: 03+2] Normalize config/sites.yaml to be machine editable [homer/public] - 10https://gerrit.wikimedia.org/r/975224 (https://phabricator.wikimedia.org/T351074) (owner: 10JMeybohm) [08:39:27] (03Merged) 10jenkins-bot: Normalize config/sites.yaml to be machine editable [homer/public] - 10https://gerrit.wikimedia.org/r/975224 (https://phabricator.wikimedia.org/T351074) (owner: 10JMeybohm) [08:41:32] (03PS4) 10WMDE-Fisch: Update the list of NavigationPopups gadget names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975021 (https://phabricator.wikimedia.org/T351314) [08:42:04] (03CR) 10WMDE-Fisch: Update the list of NavigationPopups gadget names (0311 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975021 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch) [08:44:47] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Normalize conftool-data/node/{eqiad,codfw}.yaml to be machine editable [puppet] - 10https://gerrit.wikimedia.org/r/975227 (https://phabricator.wikimedia.org/T351074) (owner: 10JMeybohm) [08:49:52] (03PS3) 10Volans: sre.hardware.upgrade-firmware: prepare for fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/974970 [08:50:19] (03CR) 10Volans: [C: 03+2] sre.hardware.upgrade-firmware: prepare for fixes (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/974970 (owner: 10Volans) [08:52:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1241 (re)pooling @ 15%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53605 and previous config saved to /var/cache/conftool/dbconfig/20231120-085233-arnaudb.json [08:52:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1242 (re)pooling @ 15%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53606 and previous config saved to /var/cache/conftool/dbconfig/20231120-085258-arnaudb.json [08:54:25] !log Refresh client certificate for central logging on pfw's - T351110 [08:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:31] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: prepare for fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/974970 (owner: 10Volans) [08:56:17] (03CR) 10JMeybohm: [C: 03+1] "Some minor comments, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/974947 (owner: 10Giuseppe Lavagetto) [08:56:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 100%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53607 and previous config saved to /var/cache/conftool/dbconfig/20231120-085636-root.json [08:57:03] (03PS3) 10Volans: sre.hardware.upgrade-firmware: run on all hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/974971 [08:57:15] (03CR) 10Volans: [C: 03+2] sre.hardware.upgrade-firmware: run on all hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/974971 (owner: 10Volans) [08:57:45] (03CR) 10Volans: [C: 03+2] "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/974971 (owner: 10Volans) [08:59:34] (03PS1) 10Elukey: changeprop: use safe_load_all for make_beta_config.py [deployment-charts] - 10https://gerrit.wikimedia.org/r/975738 [09:00:19] !log jelto@cumin1001 START - Cookbook sre.hosts.reimage for host gitlab-runner2004.codfw.wmnet with OS bullseye [09:01:23] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: run on all hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/974971 (owner: 10Volans) [09:02:52] (03CR) 10JMeybohm: [C: 03+2] Normalize conftool-data/node/{eqiad,codfw}.yaml to be machine editable [puppet] - 10https://gerrit.wikimedia.org/r/975227 (https://phabricator.wikimedia.org/T351074) (owner: 10JMeybohm) [09:07:39] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1241 (re)pooling @ 30%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53608 and previous config saved to /var/cache/conftool/dbconfig/20231120-090738-arnaudb.json [09:08:03] PROBLEM - Check systemd state on kubernetes1051 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:04] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1242 (re)pooling @ 30%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53609 and previous config saved to /var/cache/conftool/dbconfig/20231120-090803-arnaudb.json [09:09:03] (03PS3) 10Volans: sre.hardware.upgrade-firmware: add custom locking [cookbooks] - 10https://gerrit.wikimedia.org/r/974972 (https://phabricator.wikimedia.org/T341973) [09:09:06] (03CR) 10Giuseppe Lavagetto: [C: 03+1] changeprop: use safe_load_all for make_beta_config.py [deployment-charts] - 10https://gerrit.wikimedia.org/r/975738 (owner: 10Elukey) [09:09:16] (03CR) 10Volans: [C: 03+2] sre.hardware.upgrade-firmware: add custom locking [cookbooks] - 10https://gerrit.wikimedia.org/r/974972 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [09:09:44] (03CR) 10JMeybohm: [C: 04-1] modules/mesh: add capability for traffic splitting (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/974991 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [09:13:50] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: add custom locking [cookbooks] - 10https://gerrit.wikimedia.org/r/974972 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [09:14:38] (03PS3) 10Volans: sre.hosts.decommission: acquire lock for each host [cookbooks] - 10https://gerrit.wikimedia.org/r/974973 (https://phabricator.wikimedia.org/T341973) [09:14:43] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: acquire lock for each host [cookbooks] - 10https://gerrit.wikimedia.org/r/974973 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [09:15:35] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner2004.codfw.wmnet with reason: host reimage [09:17:22] (03PS1) 10KartikMistry: cxserver: Force 127.0.0.1 instead of localhost [deployment-charts] - 10https://gerrit.wikimedia.org/r/975740 (https://phabricator.wikimedia.org/T349118) [09:18:06] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner2004.codfw.wmnet with reason: host reimage [09:18:50] (03Merged) 10jenkins-bot: sre.hosts.decommission: acquire lock for each host [cookbooks] - 10https://gerrit.wikimedia.org/r/974973 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [09:22:19] (03CR) 10Volans: [C: 03+2] remote: add RemoteHost.get_subset() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/975211 (owner: 10Volans) [09:22:44] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1241 (re)pooling @ 45%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53610 and previous config saved to /var/cache/conftool/dbconfig/20231120-092243-arnaudb.json [09:23:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1242 (re)pooling @ 45%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53611 and previous config saved to /var/cache/conftool/dbconfig/20231120-092308-arnaudb.json [09:25:54] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: add alert audit via puppetdb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975284 (https://phabricator.wikimedia.org/T320931) (owner: 10Filippo Giunchedi) [09:27:40] (03CR) 10Clément Goubert: [C: 03+2] testreduce: Reduce innodb_buffer_pool_size to 4G [puppet] - 10https://gerrit.wikimedia.org/r/973390 (owner: 10Subramanya Sastry) [09:27:56] (03CR) 10Elukey: [C: 03+2] changeprop: use safe_load_all for make_beta_config.py [deployment-charts] - 10https://gerrit.wikimedia.org/r/975738 (owner: 10Elukey) [09:29:24] (03CR) 10Arnaudb: [C: 03+2] bashrc: add a function to quick show info [puppet] - 10https://gerrit.wikimedia.org/r/975746 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [09:29:42] (03Merged) 10jenkins-bot: remote: add RemoteHost.get_subset() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/975211 (owner: 10Volans) [09:33:48] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: disable auth_cas when running in OIDC SSO mode [puppet] - 10https://gerrit.wikimedia.org/r/974498 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi) [09:34:05] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner2004.codfw.wmnet with OS bullseye [09:37:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1241 (re)pooling @ 60%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53612 and previous config saved to /var/cache/conftool/dbconfig/20231120-093748-arnaudb.json [09:38:13] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1242 (re)pooling @ 60%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53613 and previous config saved to /var/cache/conftool/dbconfig/20231120-093813-arnaudb.json [09:39:35] !log add 50G to prometheus/services in eqiad [09:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:18] (03PS1) 10Arnaudb: mariadb: replace db1143 with 1243 [puppet] - 10https://gerrit.wikimedia.org/r/975747 (https://phabricator.wikimedia.org/T344036) [09:41:20] !log add 50G to prometheus/k8s in codfw [09:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:51] (03PS1) 10Marostegui: db2160: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/975742 (https://phabricator.wikimedia.org/T351386) [09:42:37] (03CR) 10Marostegui: [C: 03+2] db2160: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/975742 (https://phabricator.wikimedia.org/T351386) (owner: 10Marostegui) [09:43:26] (03PS1) 10Elukey: profile::thanks: improve Istio recording rules [puppet] - 10https://gerrit.wikimedia.org/r/975743 (https://phabricator.wikimedia.org/T351390) [09:47:49] (03CR) 10Btullis: [C: 03+1] switch druid host to run data_purge job [puppet] - 10https://gerrit.wikimedia.org/r/975248 (https://phabricator.wikimedia.org/T336043) (owner: 10Stevemunene) [09:49:25] (03PS1) 10Elukey: slo_definitions: remove one regex match for Lift Wing services [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/975744 (https://phabricator.wikimedia.org/T351390) [09:49:48] (03Abandoned) 10Elukey: slo_definitions: remove one regex match for Lift Wing services [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/975744 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [09:50:31] (03PS1) 10Elukey: slo_definitions: remove one regex match for Lift Wing services [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/975745 (https://phabricator.wikimedia.org/T351390) [09:50:52] !log restart swift_dispersion_stats on thanos-fe1001 [09:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:46] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate" - https://phabricator.wikimedia.org/T351624 (10fgiunchedi) [09:52:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1241 (re)pooling @ 75%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53614 and previous config saved to /var/cache/conftool/dbconfig/20231120-095253-arnaudb.json [09:53:18] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1242 (re)pooling @ 75%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53615 and previous config saved to /var/cache/conftool/dbconfig/20231120-095318-arnaudb.json [09:54:36] (03CR) 10Marostegui: [C: 03+1] "Don't forget to review the order of the numbers for the sections, as we discussed on IRC on Friday." [puppet] - 10https://gerrit.wikimedia.org/r/975747 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [09:55:10] (03CR) 10Arnaudb: [C: 03+2] mariadb: replace db1143 with 1243 [puppet] - 10https://gerrit.wikimedia.org/r/975747 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [10:00:03] PROBLEM - Check systemd state on puppetdb1003 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:18] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: provisionning db1243.eqiad.wmnet - T344036 [10:00:30] T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036 [10:00:32] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: provisionning db1243.eqiad.wmnet - T344036 [10:00:35] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1243.eqiad.wmnet with reason: provisionning db1243.eqiad.wmnet - T344036 [10:00:59] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1243.eqiad.wmnet with reason: provisionning db1243.eqiad.wmnet - T344036 [10:02:12] !log arnaudb@cumin1001 dbctl commit (dc=all): 'T344036 add db1243', diff saved to https://phabricator.wikimedia.org/P53616 and previous config saved to /var/cache/conftool/dbconfig/20231120-100212-arnaudb.json [10:04:17] RECOVERY - Check systemd state on kubernetes1051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:04:59] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add BGP to protocols contributing to aggregates - https://phabricator.wikimedia.org/T351456 (10ayounsi) With the addition of `L3` switches it makes sens to not only take into consideration OSPF or `L2` vlans. For unicast "regular" external... [10:05:14] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db1143.eqiad.wmnet onto db1243.eqiad.wmnet [10:06:19] (03CR) 10Giuseppe Lavagetto: [V: 03+1] kubernetes::global_config: add listener for mw on k8s transition (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/974947 (owner: 10Giuseppe Lavagetto) [10:06:31] (03PS3) 10Giuseppe Lavagetto: kubernetes::global_config: add listener for mw on k8s transition [puppet] - 10https://gerrit.wikimedia.org/r/974947 [10:07:31] (03PS1) 10Arnaudb: mariadb: repooled servers should alert [puppet] - 10https://gerrit.wikimedia.org/r/975748 (https://phabricator.wikimedia.org/T344036) [10:07:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1241 (re)pooling @ 90%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53617 and previous config saved to /var/cache/conftool/dbconfig/20231120-100758-arnaudb.json [10:08:03] (03PS1) 10Majavah: P:toolforge: uninstall tekton component [puppet] - 10https://gerrit.wikimedia.org/r/975768 [10:08:05] (03PS1) 10Majavah: aptrepo: drop tekton component [puppet] - 10https://gerrit.wikimedia.org/r/975769 [10:08:23] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1242 (re)pooling @ 90%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53618 and previous config saved to /var/cache/conftool/dbconfig/20231120-100823-arnaudb.json [10:08:24] (03PS2) 10Elukey: profile::thanks: improve Istio recording rules [puppet] - 10https://gerrit.wikimedia.org/r/975743 (https://phabricator.wikimedia.org/T351390) [10:08:52] (03CR) 10Ayounsi: "No strong preference but if "no_remote_confed" isn't needed let's not have it." [homer/public] - 10https://gerrit.wikimedia.org/r/975070 (https://phabricator.wikimedia.org/T351456) (owner: 10Cathal Mooney) [10:10:32] (03CR) 10Clément Goubert: [C: 03+1] mediawiki::packages: Clean up absented packages [puppet] - 10https://gerrit.wikimedia.org/r/975451 (owner: 10Muehlenhoff) [10:12:09] !log klausman@cumin1001 START - Cookbook sre.puppet.renew-cert for ml-serve1008.eqiad.wmnet: Renew puppet certificate - klausman@cumin1001 [10:12:24] !log klausman@cumin1001 END (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for ml-serve1008.eqiad.wmnet: Renew puppet certificate - klausman@cumin1001 [10:12:37] (03CR) 10Giuseppe Lavagetto: [C: 03+2] kubernetes::global_config: add listener for mw on k8s transition [puppet] - 10https://gerrit.wikimedia.org/r/974947 (owner: 10Giuseppe Lavagetto) [10:13:04] !log klausman@cumin1001 START - Cookbook sre.puppet.renew-cert for ml-serve1008.eqiad.wmnet: Renew puppet certificate - klausman@cumin1001 [10:13:08] !log klausman@cumin1001 END (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for ml-serve1008.eqiad.wmnet: Renew puppet certificate - klausman@cumin1001 [10:14:36] (03PS1) 10Filippo Giunchedi: oauth2_proxy: add blackbox checks [puppet] - 10https://gerrit.wikimedia.org/r/975770 (https://phabricator.wikimedia.org/T331512) [10:16:09] (03CR) 10Marostegui: [C: 03+1] mariadb: repooled servers should alert [puppet] - 10https://gerrit.wikimedia.org/r/975748 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [10:16:28] (03CR) 10Arnaudb: [C: 03+2] mariadb: repooled servers should alert [puppet] - 10https://gerrit.wikimedia.org/r/975748 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [10:16:55] (03CR) 10CI reject: [V: 04-1] oauth2_proxy: add blackbox checks [puppet] - 10https://gerrit.wikimedia.org/r/975770 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi) [10:17:57] (03PS3) 10Elukey: profile::thanos: improve Istio recording rules [puppet] - 10https://gerrit.wikimedia.org/r/975743 (https://phabricator.wikimedia.org/T351390) [10:18:22] (03PS1) 10Cyndywikime: EditGrowthConfig: Do not provide default for levelling up threshold when disabled [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975589 (https://phabricator.wikimedia.org/T351603) [10:18:32] <_joe_> jouncebot: next [10:18:32] In 0 hour(s) and 41 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231120T1100) [10:18:38] <_joe_> jouncebot: now [10:18:38] No deployments scheduled for the next 0 hour(s) and 41 minute(s) [10:20:35] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:22:11] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [10:22:18] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [10:23:04] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1241 (re)pooling @ 100%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53619 and previous config saved to /var/cache/conftool/dbconfig/20231120-102303-arnaudb.json [10:23:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1242 (re)pooling @ 100%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53620 and previous config saved to /var/cache/conftool/dbconfig/20231120-102327-arnaudb.json [10:24:20] (03PS1) 10Arnaudb: mariadb: add a new host to s6 [puppet] - 10https://gerrit.wikimedia.org/r/975749 (https://phabricator.wikimedia.org/T343674) [10:27:37] (03CR) 10Marostegui: [C: 04-1] "Missing db2193.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/975749 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [10:27:56] (03PS3) 10WMDE-Fisch: Update the list of ReferenceTooltip gadget names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974984 (https://phabricator.wikimedia.org/T351314) [10:33:47] (03PS5) 10WMDE-Fisch: Update the list of NavigationPopups gadget names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975021 (https://phabricator.wikimedia.org/T351314) [10:35:08] (03CR) 10WMDE-Fisch: Update the list of NavigationPopups gadget names (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975021 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch) [10:35:42] (03CR) 10WMDE-Fisch: Update the list of ReferenceTooltip gadget names (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974984 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch) [10:35:56] (03CR) 10Vgutierrez: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/975342 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [10:36:07] (03PS1) 10Vgutierrez: ncredir: Enable IPIP encapsulation on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/975772 (https://phabricator.wikimedia.org/T351069) [10:40:12] hello Cyndywikime, welcome! :) [10:41:51] (03PS2) 10Vgutierrez: ncredir: Enable IPIP encapsulation on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/975772 (https://phabricator.wikimedia.org/T351069) [10:41:53] (03CR) 10Arnaudb: mariadb: add a new host to s6 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975749 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [10:42:21] (03CR) 10Marostegui: [C: 03+1] mariadb: add a new host to s6 [puppet] - 10https://gerrit.wikimedia.org/r/975749 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [10:44:01] (PuppetFailure) resolved: Puppet has failed on ml-serve1008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:47:38] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Update the list of ReferenceTooltip gadget names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974984 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch) [10:48:04] (03CR) 10Fabfur: [C: 03+1] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/975324 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [10:49:01] PROBLEM - MariaDB Replica Lag: s5 on db1210 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3406.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:50:22] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate" - https://phabricator.wikimedia.org/T351624 (10jbond) @fgiunchedi what is the probing software? we do have a b... [10:50:36] !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-role for role: etcd::v3::ml_etcd [10:52:52] (03CR) 10Btullis: [C: 03+2] Configure the analytics prometheus instance to start scraping airflow [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [10:52:54] !log volans@cumin1001 START - Cookbook sre.dns.netbox [10:53:17] (03CR) 10Klausman: [C: 03+2] hiera: Migrate ML etcd role to Puppet v7 [puppet] - 10https://gerrit.wikimedia.org/r/975774 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman) [10:55:14] !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management records for ganeti103[5-8] - T349925 - volans@cumin1001" [10:55:19] T349925: Q2:rack/setup/install ganeti103[5-8] - https://phabricator.wikimedia.org/T349925 [10:56:04] !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management records for ganeti103[5-8] - T349925 - volans@cumin1001" [10:56:04] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:56:57] RECOVERY - Check systemd state on puppetdb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:57:26] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: etcd::v3::ml_etcd [10:57:29] (03CR) 10Klausman: [C: 03+1] slo_definitions: remove one regex match for Lift Wing services [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/975745 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [10:57:57] (03CR) 10Klausman: [C: 03+1] profile::thanos: improve Istio recording rules [puppet] - 10https://gerrit.wikimedia.org/r/975743 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [10:58:14] (03CR) 10Klausman: [C: 03+1] team-ml: add site to the ORES alert's dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/975736 (https://phabricator.wikimedia.org/T346151) (owner: 10Elukey) [10:58:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti103[5-8] - https://phabricator.wikimedia.org/T349925 (10Volans) The hosts were setup in Netbox with a public VLAN and FQDN (wikimedia.org) while they should have been setup with the private one (eqiad.wmnet FQDNs). The... [10:58:42] (SystemdUnitFailed) firing: export_smart_data_dump.service Failed on wdqs1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:58:46] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host druid1009.eqiad.wmnet with OS bullseye [10:59:06] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10klausman) [10:59:20] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231120T1100) [11:00:08] (03CR) 10Btullis: [C: 03+1] "I have now deployed the change to prometheus, so it will scrape this job once deployed." [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [11:00:50] !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-role for role: ml_k8s::master [11:01:01] (03CR) 10Btullis: [C: 03+1] ::analytics::refinery::job::druid_load: remove remaining jobs [puppet] - 10https://gerrit.wikimedia.org/r/906667 (https://phabricator.wikimedia.org/T334095) (owner: 10Mforns) [11:01:47] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "I can confirm the final list in patchset 5, with one exception. It seems like 'lmowiki' => 'Popup' got lost." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975021 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch) [11:01:49] (03CR) 10Btullis: [C: 03+1] airflow: convert to pull mail_smarthosts from hiera [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [11:02:10] (03Abandoned) 10Btullis: Block any open angle brackets in Archiva mirrored URLs [puppet] - 10https://gerrit.wikimedia.org/r/958930 (https://phabricator.wikimedia.org/T318962) (owner: 10Btullis) [11:03:04] (03CR) 10Btullis: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/973760 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [11:03:19] (03CR) 10Klausman: [C: 03+2] hiera: Migrate ML k8s master role to Puppet v7 [puppet] - 10https://gerrit.wikimedia.org/r/975775 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman) [11:05:37] (03CR) 10Jgiannelos: [C: 03+1] tegola: update image to pick up OS fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/973817 (https://phabricator.wikimedia.org/T348647) (owner: 10Effie Mouzeli) [11:07:16] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: ml_k8s::master [11:07:42] (03CR) 10Btullis: [C: 03+2] Increase the size of the innodb pool on analytics_meta [puppet] - 10https://gerrit.wikimedia.org/r/974164 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [11:08:21] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:09:40] (03PS6) 10WMDE-Fisch: Update the list of NavigationPopups gadget names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975021 (https://phabricator.wikimedia.org/T351314) [11:10:16] (03CR) 10WMDE-Fisch: Update the list of NavigationPopups gadget names (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975021 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch) [11:12:02] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:16:30] !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-role for role: ml_k8s::worker [11:16:56] (03CR) 10Klausman: [C: 03+2] hiera: Migrate ML k8s worker role to Puppet v7 [puppet] - 10https://gerrit.wikimedia.org/r/975776 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman) [11:17:12] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Update the list of NavigationPopups gadget names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975021 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch) [11:18:34] PROBLEM - Check systemd state on dbstore1007 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:20:01] (03PS2) 10Btullis: Remove the oozie integration from hue [puppet] - 10https://gerrit.wikimedia.org/r/974646 (https://phabricator.wikimedia.org/T341893) [11:20:03] (03PS2) 10Btullis: Remove oozie configuration from core hadoop configuration files [puppet] - 10https://gerrit.wikimedia.org/r/974647 (https://phabricator.wikimedia.org/T341893) [11:20:06] (03PS2) 10Btullis: Remove all remaining references to oozie and clean up [puppet] - 10https://gerrit.wikimedia.org/r/974649 (https://phabricator.wikimedia.org/T341893) [11:21:22] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: ml_k8s::worker [11:22:03] (03PS24) 10Btullis: Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [11:22:31] (03CR) 10Arnaudb: [C: 03+2] mariadb: add a new host to s6 [puppet] - 10https://gerrit.wikimedia.org/r/975749 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [11:22:51] (03CR) 10Volans: [C: 03+1] "LGTM, this should be merged (and then released) before the puppet change, as it's backward compatible" [software/spicerack] - 10https://gerrit.wikimedia.org/r/975273 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [11:23:25] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 14): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/569/co" [puppet] - 10https://gerrit.wikimedia.org/r/974649 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis) [11:23:43] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on druid1009.eqiad.wmnet with reason: host reimage [11:25:01] (03CR) 10Giuseppe Lavagetto: modules/mesh: add capability for traffic splitting (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/974991 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [11:25:22] (03PS3) 10Giuseppe Lavagetto: modules/mesh: add new configuration and networkpolicy modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/974990 [11:25:24] (03PS3) 10Giuseppe Lavagetto: modules/mesh: add capability for traffic splitting [deployment-charts] - 10https://gerrit.wikimedia.org/r/974991 (https://phabricator.wikimedia.org/T350846) [11:26:17] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on druid1009.eqiad.wmnet with reason: host reimage [11:27:44] (03CR) 10Btullis: [C: 03+2] Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [11:28:35] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10klausman) [11:29:49] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:31:18] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM but please ensure the corresponding spicerack patch makes it to production ASAP as well." [puppet] - 10https://gerrit.wikimedia.org/r/974623 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [11:31:56] ACKNOWLEDGEMENT - MariaDB Replica Lag: s5 on db1210 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 5810.33 seconds Marostegui expired downtime https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:32:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1210', diff saved to https://phabricator.wikimedia.org/P53622 and previous config saved to /var/cache/conftool/dbconfig/20231120-113205-root.json [11:32:56] (03PS3) 10Vgutierrez: service: Add ipip_encapsulation field to ServiceLVS [software/spicerack] - 10https://gerrit.wikimedia.org/r/975273 (https://phabricator.wikimedia.org/T351069) [11:35:40] (03CR) 10CI reject: [V: 04-1] modules/mesh: add capability for traffic splitting [deployment-charts] - 10https://gerrit.wikimedia.org/r/974991 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [11:35:52] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 13): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/570/console" [puppet] - 10https://gerrit.wikimedia.org/r/975778 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman) [11:36:04] (03PS2) 10Klausman: hiera: Cleanup leftovers of Puppet v7 migration for ML machines [puppet] - 10https://gerrit.wikimedia.org/r/975778 (https://phabricator.wikimedia.org/T349619) [11:39:16] (03CR) 10Giuseppe Lavagetto: modules/mesh: add capability for traffic splitting (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/974991 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [11:40:56] (03PS4) 10Giuseppe Lavagetto: modules/mesh: add capability for traffic splitting [deployment-charts] - 10https://gerrit.wikimedia.org/r/974991 (https://phabricator.wikimedia.org/T350846) [11:41:45] (03CR) 10Vgutierrez: [C: 03+2] service: Add ipip_encapsulation field to ServiceLVS [software/spicerack] - 10https://gerrit.wikimedia.org/r/975273 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [11:47:12] (03PS2) 10Filippo Giunchedi: oauth2_proxy: add blackbox checks [puppet] - 10https://gerrit.wikimedia.org/r/975770 (https://phabricator.wikimedia.org/T331512) [11:48:33] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host druid1009.eqiad.wmnet with OS bullseye [11:48:43] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate" - https://phabricator.wikimedia.org/T351624 (10fgiunchedi) The software in this case is prometheus blackbox ex... [11:55:38] RECOVERY - MariaDB Replica Lag: s5 on db1210 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:55:49] (03CR) 10JMeybohm: [C: 03+1] modules/mesh: add capability for traffic splitting [deployment-charts] - 10https://gerrit.wikimedia.org/r/974991 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [11:56:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 10%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53623 and previous config saved to /var/cache/conftool/dbconfig/20231120-115635-root.json [11:56:56] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:42] (SystemdUnitFailed) resolved: export_smart_data_dump.service Failed on wdqs1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:00:54] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 13): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/574/console" [puppet] - 10https://gerrit.wikimedia.org/r/975778 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman) [12:01:06] (03CR) 10Klausman: [V: 03+1 C: 03+2] hiera: Cleanup leftovers of Puppet v7 migration for ML machines [puppet] - 10https://gerrit.wikimedia.org/r/975778 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman) [12:04:52] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: provisionning db2193.codfw.wmnet - T343674 [12:04:56] T343674: Productionize db21[88-95] - https://phabricator.wikimedia.org/T343674 [12:05:17] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: provisionning db2193.codfw.wmnet - T343674 [12:05:20] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2193.codfw.wmnet with reason: provisionning db2193.codfw.wmnet - T343674 [12:05:34] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2193.codfw.wmnet with reason: provisionning db2193.codfw.wmnet - T343674 [12:07:21] (03PS3) 10Saint Johann: Enable action blocks in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973795 (https://phabricator.wikimedia.org/T351048) [12:07:43] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Cloning db2180 in db2193 for T343674', diff saved to https://phabricator.wikimedia.org/P53624 and previous config saved to /var/cache/conftool/dbconfig/20231120-120743-arnaudb.json [12:09:31] (03CR) 10Btullis: "Looks good in general. Small query about variable names." [puppet] - 10https://gerrit.wikimedia.org/r/975291 (https://phabricator.wikimedia.org/T346887) (owner: 10Brouberol) [12:09:59] (03PS1) 10Jbond: pcc:clean_reports: [puppet] - 10https://gerrit.wikimedia.org/r/975781 (https://phabricator.wikimedia.org/T336350) [12:10:06] (03CR) 10Klausman: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/975780 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [12:10:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/575/console" [puppet] - 10https://gerrit.wikimedia.org/r/975781 (https://phabricator.wikimedia.org/T336350) (owner: 10Jbond) [12:11:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 25%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53625 and previous config saved to /var/cache/conftool/dbconfig/20231120-121140-root.json [12:13:05] (03PS1) 10Arnaudb: mariadb: fix mariadb::shard missing error [puppet] - 10https://gerrit.wikimedia.org/r/975751 (https://phabricator.wikimedia.org/T343674) [12:13:22] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:13:43] (03CR) 10Marostegui: [C: 03+1] mariadb: fix mariadb::shard missing error [puppet] - 10https://gerrit.wikimedia.org/r/975751 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [12:15:48] RECOVERY - Check systemd state on dbstore1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:16:10] (03PS2) 10Jbond: pcc:clean_reports: [puppet] - 10https://gerrit.wikimedia.org/r/975781 (https://phabricator.wikimedia.org/T336350) [12:17:03] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/576/con" [puppet] - 10https://gerrit.wikimedia.org/r/975781 (https://phabricator.wikimedia.org/T336350) (owner: 10Jbond) [12:17:06] (03CR) 10Arnaudb: [C: 03+2] mariadb: fix mariadb::shard missing error [puppet] - 10https://gerrit.wikimedia.org/r/975751 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [12:17:46] PROBLEM - snapshot of s1 in codfw on backupmon1001 is CRITICAL: snapshot for s1 at codfw (db2141) taken more than 3 days ago: Most recent backup 2023-11-17 12:00:06 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [12:18:40] (03CR) 10Jbond: [V: 03+1 C: 03+2] pcc:clean_reports: [puppet] - 10https://gerrit.wikimedia.org/r/975781 (https://phabricator.wikimedia.org/T336350) (owner: 10Jbond) [12:20:54] (03CR) 10Majavah: [C: 03+2] Add wiki replica backends to conftool [puppet] - 10https://gerrit.wikimedia.org/r/973760 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [12:22:15] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1143.eqiad.wmnet onto db1243.eqiad.wmnet [12:22:41] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db2180.codfw.wmnet onto db2193.codfw.wmnet [12:26:03] (03CR) 10Btullis: [C: 03+1] "FYI: these two servers may well become obsolete soon, as the wikireplicas are planned to move behind the new cloudlb tier in this ticket." [puppet] - 10https://gerrit.wikimedia.org/r/952455 (https://phabricator.wikimedia.org/T337848) (owner: 10Jbond) [12:26:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 50%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53627 and previous config saved to /var/cache/conftool/dbconfig/20231120-122645-root.json [12:27:22] (03PS1) 10Jbond: base: switch rsyslog tls_netstream_driver to ossl [puppet] - 10https://gerrit.wikimedia.org/r/975791 (https://phabricator.wikimedia.org/T324623) [12:28:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/975791 (https://phabricator.wikimedia.org/T324623) (owner: 10Jbond) [12:29:13] (03CR) 10Btullis: [C: 03+1] "This looks good to me. Should we add the required rights manually before merging?" [puppet] - 10https://gerrit.wikimedia.org/r/973777 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [12:31:05] (03CR) 10Btullis: "This looks good to me in principle. I see that you've noted a TODO about getting the right IP addresses, so I won't vote until that is don" [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [12:32:49] (03PS1) 10Arnaudb: mariadb: add a new host on s5 [puppet] - 10https://gerrit.wikimedia.org/r/975752 (https://phabricator.wikimedia.org/T343674) [12:33:50] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate" - https://phabricator.wikimedia.org/T351624 (10jbond) >>! In T351624#9344407, @fgiunchedi wrote: > The softwar... [12:34:14] (03PS1) 10Majavah: wikireplicas: update-views: try to do changes live [cookbooks] - 10https://gerrit.wikimedia.org/r/975796 [12:36:01] (03CR) 10Marostegui: mariadb: add a new host on s5 (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/975752 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [12:36:56] 10SRE, 10ops-eqiad, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcontrol1006: move to new network setup - https://phabricator.wikimedia.org/T346891 (10taavi) [12:37:02] 10SRE, 10Cloud-VPS, 10cloud-services-team, 10observability, and 3 others: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 (10jbond) Reading the task it seems like the last blocker was to "wait out buster" (T324623#8449852). however as we have now deployed this to buster (T32462... [12:37:24] (03CR) 10D3r1ck01: Set new $wgMicroStashType setting to "mcrouter-primary-dc" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974506 (https://phabricator.wikimedia.org/T336004) (owner: 10D3r1ck01) [12:37:44] (03PS2) 10Arnaudb: mariadb: add a new host on s5 [puppet] - 10https://gerrit.wikimedia.org/r/975752 (https://phabricator.wikimedia.org/T343674) [12:38:04] (03CR) 10CI reject: [V: 04-1] wikireplicas: update-views: try to do changes live [cookbooks] - 10https://gerrit.wikimedia.org/r/975796 (owner: 10Majavah) [12:38:22] (03CR) 10Jbond: [C: 03+2] realm.pp: drop use_puppetdb global [puppet] - 10https://gerrit.wikimedia.org/r/971463 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [12:39:29] (03CR) 10Marostegui: [C: 03+1] mariadb: add a new host on s5 [puppet] - 10https://gerrit.wikimedia.org/r/975752 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [12:39:54] (03CR) 10Arnaudb: [C: 03+2] mariadb: add a new host on s5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975752 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [12:41:11] (03PS2) 10Majavah: wikireplicas: update-views: try to do changes live [cookbooks] - 10https://gerrit.wikimedia.org/r/975796 [12:41:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 75%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53628 and previous config saved to /var/cache/conftool/dbconfig/20231120-124150-root.json [12:43:20] (03PS4) 10Jbond: realm.pp: remove old comments [puppet] - 10https://gerrit.wikimedia.org/r/971464 (https://phabricator.wikimedia.org/T350008) [12:43:22] (03PS4) 10Jbond: sanitarium_multiinstance: over private_wiki and private_tables vars to hiera [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) [12:43:24] (03PS4) 10Jbond: realm.pp: drop wikimail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971469 (https://phabricator.wikimedia.org/T350008) [12:43:26] (03PS4) 10Jbond: airflow: convert to pull mail_smarthosts from hiera [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) [12:43:28] (03PS4) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) [12:43:30] (03PS4) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) [12:43:51] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: provisionning db2192.codfw.wmnet - T343674 [12:43:55] T343674: Productionize db21[88-95] - https://phabricator.wikimedia.org/T343674 [12:44:05] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: provisionning db2192.codfw.wmnet - T343674 [12:44:08] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2192.codfw.wmnet with reason: provisionning db2192.codfw.wmnet - T343674 [12:44:22] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2192.codfw.wmnet with reason: provisionning db2192.codfw.wmnet - T343674 [12:45:23] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Cloning db2178 in db2192 for T343674', diff saved to https://phabricator.wikimedia.org/P53629 and previous config saved to /var/cache/conftool/dbconfig/20231120-124522-arnaudb.json [12:46:13] (03CR) 10Jbond: [C: 03+2] realm.pp: remove old comments [puppet] - 10https://gerrit.wikimedia.org/r/971464 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [12:48:14] (03CR) 10Jbond: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [12:48:45] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db2178.codfw.wmnet onto db2192.codfw.wmnet [12:49:57] (03PS3) 10Filippo Giunchedi: oauth2_proxy: add blackbox checks [puppet] - 10https://gerrit.wikimedia.org/r/975770 (https://phabricator.wikimedia.org/T331512) [12:51:14] just checking: it’s okay to deploy stuff today, right? thanksgiving isn’t until thursday? [12:52:55] (03CR) 10Filippo Giunchedi: [C: 03+1] "let's try it!" [puppet] - 10https://gerrit.wikimedia.org/r/975791 (https://phabricator.wikimedia.org/T324623) (owner: 10Jbond) [12:54:50] (03CR) 10Elukey: [C: 03+1] Clean up additional ORES leftovers [puppet] - 10https://gerrit.wikimedia.org/r/975780 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [12:55:25] (03CR) 10Marostegui: "Can you do a PCC for db1154, db1155, db2186, db2187?" [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [12:55:44] PROBLEM - Check systemd state on kubernetes2013 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:26] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::thanos: improve Istio recording rules [puppet] - 10https://gerrit.wikimedia.org/r/975743 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [12:56:35] (03CR) 10Filippo Giunchedi: [C: 03+1] slo_definitions: remove one regex match for Lift Wing services [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/975745 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [12:56:51] (03PS1) 10Marostegui: db2133,db1217: Remove package declaracion [puppet] - 10https://gerrit.wikimedia.org/r/975798 (https://phabricator.wikimedia.org/T351386) [12:56:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 100%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53630 and previous config saved to /var/cache/conftool/dbconfig/20231120-125655-root.json [12:57:02] (03PS1) 10Cathal Mooney: Reset spine switch BGP to CR if max prefix tripped after 30 mins [homer/public] - 10https://gerrit.wikimedia.org/r/975799 (https://phabricator.wikimedia.org/T349116) [12:57:14] (03CR) 10Elukey: [C: 03+2] profile::thanos: improve Istio recording rules [puppet] - 10https://gerrit.wikimedia.org/r/975743 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [12:57:43] (03CR) 10Marostegui: [C: 03+2] db2133,db1217: Remove package declaracion [puppet] - 10https://gerrit.wikimedia.org/r/975798 (https://phabricator.wikimedia.org/T351386) (owner: 10Marostegui) [12:57:54] (03CR) 10Elukey: [V: 03+2 C: 03+2] slo_definitions: remove one regex match for Lift Wing services [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/975745 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [12:58:52] (03PS5) 10Jbond: realm.pp: drop wikimail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971469 (https://phabricator.wikimedia.org/T350008) [12:58:54] (03PS5) 10Jbond: airflow: convert to pull mail_smarthosts from hiera [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) [12:58:56] (03PS5) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) [12:58:58] (03PS5) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) [12:59:43] (03CR) 10Jbond: realm.pp: drop wikimail_smarthost global (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971469 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [13:02:41] (03CR) 10Ilias Sarantopoulos: [C: 03+1] team-ml: add site to the ORES alert's dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/975736 (https://phabricator.wikimedia.org/T346151) (owner: 10Elukey) [13:03:06] (03CR) 10Elukey: [C: 03+2] team-ml: add site to the ORES alert's dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/975736 (https://phabricator.wikimedia.org/T346151) (owner: 10Elukey) [13:04:21] (03Merged) 10jenkins-bot: team-ml: add site to the ORES alert's dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/975736 (https://phabricator.wikimedia.org/T346151) (owner: 10Elukey) [13:06:08] (03PS10) 10Ilias Sarantopoulos: team-ml: add alert for memory spike in inf services [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) [13:06:55] (03CR) 10Ilias Sarantopoulos: team-ml: add alert for memory spike in inf services (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [13:07:17] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [13:09:59] (03CR) 10Klausman: [C: 03+1] team-ml: add alert for memory spike in inf services [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [13:10:17] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/584/console" [puppet] - 10https://gerrit.wikimedia.org/r/975770 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi) [13:10:52] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] oauth2_proxy: add blackbox checks [puppet] - 10https://gerrit.wikimedia.org/r/975770 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi) [13:22:19] (03PS1) 10Ladsgroup: Set pagelinks migration to read new in testwiki, fawikiquote, cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975806 (https://phabricator.wikimedia.org/T351237) [13:24:20] (03PS1) 10Elukey: profile::thanos: improve Istio recording rules [puppet] - 10https://gerrit.wikimedia.org/r/975808 (https://phabricator.wikimedia.org/T351390) [13:25:42] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-host for host pc1014.eqiad.wmnet [13:27:48] (03PS1) 10Jbond: pc1014: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/975809 (https://phabricator.wikimedia.org/T349619) [13:29:32] (03PS5) 10Jbond: sanitarium_multiinstance: over private_wiki and private_tables vars to hiera [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) [13:29:53] (03CR) 10Jbond: [C: 03+2] pc1014: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/975809 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [13:30:15] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2180.codfw.wmnet onto db2193.codfw.wmnet [13:33:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T348183)', diff saved to https://phabricator.wikimedia.org/P53631 and previous config saved to /var/cache/conftool/dbconfig/20231120-133316-arnaudb.json [13:33:21] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [13:33:43] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host pc1014.eqiad.wmnet [13:35:12] (03PS1) 10Filippo Giunchedi: profile: adjust oidc probe name [puppet] - 10https://gerrit.wikimedia.org/r/975812 (https://phabricator.wikimedia.org/T331512) [13:35:27] (03CR) 10CI reject: [V: 04-1] profile: adjust oidc probe name [puppet] - 10https://gerrit.wikimedia.org/r/975812 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi) [13:35:36] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::thanos: improve Istio recording rules [puppet] - 10https://gerrit.wikimedia.org/r/975808 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [13:36:24] (03CR) 10Elukey: [C: 03+2] profile::thanos: improve Istio recording rules [puppet] - 10https://gerrit.wikimedia.org/r/975808 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [13:37:35] (03PS2) 10Filippo Giunchedi: profile: adjust oidc probe name [puppet] - 10https://gerrit.wikimedia.org/r/975812 (https://phabricator.wikimedia.org/T331512) [13:37:43] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2178.codfw.wmnet onto db2192.codfw.wmnet [13:40:17] (03PS1) 10Lucas Werkmeister (WMDE): Add update.php maintenance script to fix pp_sortkey [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975593 (https://phabricator.wikimedia.org/T350224) [13:40:49] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Warning re: excessive directory entries on prometheus with puppet7 - https://phabricator.wikimedia.org/T351643 (10fgiunchedi) [13:41:04] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Warning re: excessive directory entries on prometheus with puppet7 - https://phabricator.wikimedia.org/T351643 (10fgiunchedi) p:05Triage→03Low [13:42:02] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: adjust oidc probe name [puppet] - 10https://gerrit.wikimedia.org/r/975812 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi) [13:42:33] (03CR) 10Jbond: [C: 04-1] "this is used in profile::mail::default_mail_relay" [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [13:43:34] PROBLEM - Check systemd state on ganeti2012 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:43:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/586/console" [puppet] - 10https://gerrit.wikimedia.org/r/971469 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [13:46:48] (03CR) 10Jgiannelos: [C: 03+1] wikifeeds: add rest-gateway config and bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/971407 (https://phabricator.wikimedia.org/T349517) (owner: 10Hnowlan) [13:48:23] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P53632 and previous config saved to /var/cache/conftool/dbconfig/20231120-134822-arnaudb.json [13:49:22] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Warning re: excessive directory entries on prometheus with puppet7 - https://phabricator.wikimedia.org/T351643 (10Volans) I think that the problem is that the directory is defined in puppet with recurse=true in `modules/prometheus/m... [13:49:45] (03CR) 10Giuseppe Lavagetto: [C: 03+2] modules/mesh: add new configuration and networkpolicy modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/974990 (owner: 10Giuseppe Lavagetto) [13:50:45] (03Merged) 10jenkins-bot: modules/mesh: add new configuration and networkpolicy modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/974990 (owner: 10Giuseppe Lavagetto) [13:52:14] RECOVERY - Check systemd state on kubernetes2013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:52:45] (03CR) 10Giuseppe Lavagetto: [C: 03+2] modules/mesh: add capability for traffic splitting [deployment-charts] - 10https://gerrit.wikimedia.org/r/974991 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [13:58:41] (03PS1) 10Giuseppe Lavagetto: mobileapps: upgrade mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/975815 [13:58:43] (03PS1) 10Giuseppe Lavagetto: mobileapps: switch to use the traffic percentage split endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/975816 (https://phabricator.wikimedia.org/T350846) [13:59:46] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Warning re: excessive directory entries on prometheus with puppet7 - https://phabricator.wikimedia.org/T351643 (10fgiunchedi) I don't remember the context for the change (https://gerrit.wikimedia.org/r/c/operations/puppet/+/830640)... [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231120T1400). [14:00:05] Cyndywikime, WMDE-Fisch, xSavitar, and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:12] I can deploy today [14:00:55] urbanecm, thanks o/ [14:01:44] Lucas_WMDE: Cyndywikime: hello, around for the window as well? :) [14:02:01] yes i am . thanks Martin! [14:02:08] great! [14:02:09] ;) [14:02:17] :D [14:02:43] (03Merged) 10jenkins-bot: modules/mesh: add capability for traffic splitting [deployment-charts] - 10https://gerrit.wikimedia.org/r/974991 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [14:02:47] (03CR) 10Urbanecm: [C: 03+2] EditGrowthConfig: Do not provide default for levelling up threshold when disabled [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975589 (https://phabricator.wikimedia.org/T351603) (owner: 10Cyndywikime) [14:03:20] (03PS4) 10Urbanecm: Set new $wgMicroStashType setting to "mcrouter-primary-dc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974506 (https://phabricator.wikimedia.org/T336004) (owner: 10D3r1ck01) [14:03:24] (03CR) 10Urbanecm: [C: 03+2] Set new $wgMicroStashType setting to "mcrouter-primary-dc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974506 (https://phabricator.wikimedia.org/T336004) (owner: 10D3r1ck01) [14:03:29] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P53633 and previous config saved to /var/cache/conftool/dbconfig/20231120-140329-arnaudb.json [14:04:14] oops, I was distracted [14:04:16] I’m around [14:04:25] urbanecm: I take it you’re deploying? [14:04:28] welcome to the window Lucas_WMDE :) [14:04:29] yup [14:04:31] ok :) [14:04:33] (03PS1) 10Filippo Giunchedi: Remove beta-prometheus [puppet] - 10https://gerrit.wikimedia.org/r/975819 (https://phabricator.wikimedia.org/T344974) [14:04:40] (03CR) 10Urbanecm: [C: 03+2] Add update.php maintenance script to fix pp_sortkey [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975593 (https://phabricator.wikimedia.org/T350224) (owner: 10Lucas Werkmeister (WMDE)) [14:04:46] I can do my backport at the end if there’s time left, otherwise I’ll do it another time [14:04:47] ok, or that ^^ [14:04:49] Lucas_WMDE: looks like your script needs no testing though. [14:04:56] yeah, I’ll just run it later [14:05:00] sounds good [14:05:02] (03Merged) 10jenkins-bot: Set new $wgMicroStashType setting to "mcrouter-primary-dc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974506 (https://phabricator.wikimedia.org/T336004) (owner: 10D3r1ck01) [14:05:13] thanks! [14:05:23] i'll just ship it with something then :) [14:05:39] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:974506|Set new $wgMicroStashType setting to "mcrouter-primary-dc" (T336004)]] [14:05:40] 🚢 [14:05:45] T336004: Recognize 4th cache service interface in MediaWiki (Migrate ConfirmEdit tokens from MainStash to mcrouter-primary-dc) - https://phabricator.wikimedia.org/T336004 [14:05:47] urbanecm, the new config setting also doesn't need any testing (for now). There is a core patch that will begin using it soon. [14:05:54] xSavitar: ack, noted, thank you. [14:06:28] thank you [14:06:37] seems we're still missing WMDE-Fisch? [14:06:57] !log urbanecm@deploy2002 urbanecm and d3r1ck01: Backport for [[gerrit:974506|Set new $wgMicroStashType setting to "mcrouter-primary-dc" (T336004)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:07:01] !log urbanecm@deploy2002 urbanecm and d3r1ck01: Continuing with sync [14:07:03] xSavitar: syncing yours [14:08:30] urbanecm: I’ve pinged him [14:09:14] thanks Lucas_WMDE [14:09:18] (03CR) 10Majavah: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/975819 (https://phabricator.wikimedia.org/T344974) (owner: 10Filippo Giunchedi) [14:09:31] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10Joe) As you might have noticed by the patches here, we've pivoted as traffic splitting to the canaries via kube-proxy converges over hours, not seconds which is what we'll nee... [14:11:14] welcome WMDE-Fisch [14:11:29] o/ [14:11:32] !log arnaudb@cumin1001 dbctl commit (dc=all): 'set es2028 as es1 master for T344589', diff saved to https://phabricator.wikimedia.org/P53634 and previous config saved to /var/cache/conftool/dbconfig/20231120-141131-arnaudb.json [14:11:54] (03PS6) 10Jbond: sanitarium_multiinstance: over private_wiki and private_tables vars to hiera [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) [14:11:56] (03PS6) 10Jbond: realm.pp: drop wikimail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971469 (https://phabricator.wikimedia.org/T350008) [14:11:58] (03PS6) 10Jbond: airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) [14:12:00] (03PS6) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) [14:12:02] (03PS6) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) [14:12:04] (03PS1) 10Jbond: mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) [14:12:06] (03PS1) 10Jbond: phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) [14:12:08] (03PS1) 10Jbond: vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) [14:12:11] WMDE-Fisch: just double checking, both your changes can go out together, right? [14:12:19] yes [14:12:21] ack [14:12:26] (03PS4) 10Urbanecm: Update the list of ReferenceTooltip gadget names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974984 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch) [14:12:29] (03CR) 10Urbanecm: [C: 03+2] Update the list of ReferenceTooltip gadget names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974984 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch) [14:12:33] (03PS7) 10Urbanecm: Update the list of NavigationPopups gadget names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975021 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch) [14:12:36] (03CR) 10Urbanecm: [C: 03+2] Update the list of NavigationPopups gadget names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975021 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch) [14:12:38] (03CR) 10Ottomata: Export the replication factor of kafka topics as a prometheus metric (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975291 (https://phabricator.wikimedia.org/T346887) (owner: 10Brouberol) [14:12:45] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:974506|Set new $wgMicroStashType setting to "mcrouter-primary-dc" (T336004)]] (duration: 07m 06s) [14:12:56] T336004: Recognize 4th cache service interface in MediaWiki (Migrate ConfirmEdit tokens from MainStash to mcrouter-primary-dc) - https://phabricator.wikimedia.org/T336004 [14:13:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974984 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch) [14:13:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975021 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch) [14:13:12] !log arnaudb@cumin1001 dbctl commit (dc=all): 'prepare reboot of es2032 for T344589', diff saved to https://phabricator.wikimedia.org/P53635 and previous config saved to /var/cache/conftool/dbconfig/20231120-141312-arnaudb.json [14:14:55] (03PS1) 10Bking: cloudelastic: force Puppet 7 for cloudelastic1010 [puppet] - 10https://gerrit.wikimedia.org/r/975824 (https://phabricator.wikimedia.org/T351354) [14:15:01] (03Merged) 10jenkins-bot: Update the list of ReferenceTooltip gadget names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974984 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch) [14:15:08] (03CR) 10CI reject: [V: 04-1] airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [14:15:22] finally. quite some time for config change CI :) [14:15:35] (03PS8) 10Urbanecm: Update the list of NavigationPopups gadget names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975021 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch) [14:15:42] (03CR) 10Urbanecm: [C: 03+2] Update the list of NavigationPopups gadget names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975021 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch) [14:15:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975021 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch) [14:16:26] (03Merged) 10jenkins-bot: Update the list of NavigationPopups gadget names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975021 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch) [14:16:38] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:974984|Update the list of ReferenceTooltip gadget names (T351314)]], [[gerrit:975021|Update the list of NavigationPopups gadget names (T351314)]] [14:16:47] T351314: Update the list of ReferenceTooltips and NavigationPopups gadgets - https://phabricator.wikimedia.org/T351314 [14:16:47] (03PS7) 10Jbond: sanitarium_multiinstance: over private_wiki and private_tables vars to hiera [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) [14:16:49] (03PS7) 10Jbond: realm.pp: drop wikimail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971469 (https://phabricator.wikimedia.org/T350008) [14:16:51] (03PS2) 10Jbond: mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) [14:16:53] (03PS7) 10Jbond: airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) [14:16:55] (03PS2) 10Jbond: phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) [14:16:58] (03PS2) 10Jbond: vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) [14:17:00] (03PS7) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) [14:17:01] (03PS7) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) [14:17:50] (03PS8) 10Jbond: sanitarium_multiinstance: over private_wiki and private_tables vars to hiera [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) [14:17:51] (03PS8) 10Jbond: realm.pp: drop wikimail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971469 (https://phabricator.wikimedia.org/T350008) [14:17:53] (03PS3) 10Jbond: mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) [14:17:54] !log urbanecm@deploy2002 urbanecm and wmde-fisch: Backport for [[gerrit:974984|Update the list of ReferenceTooltip gadget names (T351314)]], [[gerrit:975021|Update the list of NavigationPopups gadget names (T351314)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:17:56] (03PS8) 10Jbond: airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) [14:17:58] (03PS3) 10Jbond: phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) [14:18:00] (03PS3) 10Jbond: vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) [14:18:02] (03PS8) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) [14:18:04] (03PS8) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) [14:18:05] WMDE-Fisch: your patches are at the debug server, can you test please? [14:18:06] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mobileapps: upgrade mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/975815 (owner: 10Giuseppe Lavagetto) [14:18:15] urbanecm: Doing ... [14:18:36] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T348183)', diff saved to https://phabricator.wikimedia.org/P53636 and previous config saved to /var/cache/conftool/dbconfig/20231120-141835-arnaudb.json [14:18:37] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [14:18:40] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [14:18:51] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [14:18:57] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3314 (T348183)', diff saved to https://phabricator.wikimedia.org/P53637 and previous config saved to /var/cache/conftool/dbconfig/20231120-141857-arnaudb.json [14:19:16] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/975824 (https://phabricator.wikimedia.org/T351354) (owner: 10Bking) [14:19:40] (03CR) 10Bking: [C: 03+2] cloudelastic: force Puppet 7 for cloudelastic1010 [puppet] - 10https://gerrit.wikimedia.org/r/975824 (https://phabricator.wikimedia.org/T351354) (owner: 10Bking) [14:20:41] urbanecm: Works, thanks! [14:20:45] (03CR) 10Filippo Giunchedi: [C: 03+2] Remove beta-prometheus [puppet] - 10https://gerrit.wikimedia.org/r/975819 (https://phabricator.wikimedia.org/T344974) (owner: 10Filippo Giunchedi) [14:20:46] great, syncing! [14:20:48] !log urbanecm@deploy2002 urbanecm and wmde-fisch: Continuing with sync [14:21:47] (03PS1) 10Cathal Mooney: Remove cloud hosts except clouddb from the "no IPv6 hostname" list [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/975826 (https://phabricator.wikimedia.org/T37947) [14:22:02] (03CR) 10CI reject: [V: 04-1] mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [14:22:05] (03Merged) 10jenkins-bot: mobileapps: upgrade mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/975815 (owner: 10Giuseppe Lavagetto) [14:22:18] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1010.wikimedia.org with OS bullseye [14:22:38] (03Merged) 10jenkins-bot: EditGrowthConfig: Do not provide default for levelling up threshold when disabled [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975589 (https://phabricator.wikimedia.org/T351603) (owner: 10Cyndywikime) [14:22:41] (03Merged) 10jenkins-bot: Add update.php maintenance script to fix pp_sortkey [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975593 (https://phabricator.wikimedia.org/T350224) (owner: 10Lucas Werkmeister (WMDE)) [14:22:45] finally, backports merged :) [14:23:42] (03CR) 10CI reject: [V: 04-1] airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [14:24:20] \o/ [14:24:43] (03CR) 10CI reject: [V: 04-1] phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [14:24:55] (03CR) 10CI reject: [V: 04-1] vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [14:25:04] urbanecm, thanks! [14:25:10] np [14:26:26] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:974984|Update the list of ReferenceTooltip gadget names (T351314)]], [[gerrit:975021|Update the list of NavigationPopups gadget names (T351314)]] (duration: 09m 48s) [14:26:30] T351314: Update the list of ReferenceTooltips and NavigationPopups gadgets - https://phabricator.wikimedia.org/T351314 [14:26:46] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:975589|EditGrowthConfig: Do not provide default for levelling up threshold when disabled (T351603)]], [[gerrit:975593|Add update.php maintenance script to fix pp_sortkey (T350224)]] [14:26:50] Cyndywikime: Lucas_WMDE: working on your patches now :) [14:26:51] T351603: When Levelling up features are disabled, browsing Special:EditGrowthConfig errors out - https://phabricator.wikimedia.org/T351603 [14:26:52] T350224: [LEX] pp_sortkey is null for wb-claims, wbl-forms and wbl-senses on many Lexemes - https://phabricator.wikimedia.org/T350224 [14:28:06] !log urbanecm@deploy2002 urbanecm and lucaswerkmeister-wmde and cyndywikime: Backport for [[gerrit:975589|EditGrowthConfig: Do not provide default for levelling up threshold when disabled (T351603)]], [[gerrit:975593|Add update.php maintenance script to fix pp_sortkey (T350224)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:28:16] Cyndywikime: your patch is now available at mwdebug2001. Can you test it, please? [14:31:23] patch resolves the error. [14:31:28] great, proceeding [14:31:32] !log urbanecm@deploy2002 urbanecm and lucaswerkmeister-wmde and cyndywikime: Continuing with sync [14:31:54] (03CR) 10Anzx: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975376 (https://phabricator.wikimedia.org/T350373) (owner: 10Anzx) [14:36:40] 10SRE, 10procurement: Investigation/quote for additional SSDs on Prometheus hosts - https://phabricator.wikimedia.org/T351645 (10fgiunchedi) [14:37:13] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:975589|EditGrowthConfig: Do not provide default for levelling up threshold when disabled (T351603)]], [[gerrit:975593|Add update.php maintenance script to fix pp_sortkey (T350224)]] (duration: 10m 28s) [14:37:19] T351603: When Levelling up features are disabled, browsing Special:EditGrowthConfig errors out - https://phabricator.wikimedia.org/T351603 [14:37:20] T350224: [LEX] pp_sortkey is null for wb-claims, wbl-forms and wbl-senses on many Lexemes - https://phabricator.wikimedia.org/T350224 [14:37:23] Cyndywikime: Lucas_WMDE: your patches are deployed now :) [14:37:49] \o/ [14:37:50] thanks! [14:37:53] np :) [14:38:23] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:31] (03CR) 10Elukey: [C: 03+2] tox.ini: whitelist_externals -> allowlist_externals [software] - 10https://gerrit.wikimedia.org/r/955880 (https://phabricator.wikimedia.org/T345695) (owner: 10Hashar) [14:38:37] (03CR) 10CI reject: [V: 04-1] tox.ini: whitelist_externals -> allowlist_externals [software] - 10https://gerrit.wikimedia.org/r/955880 (https://phabricator.wikimedia.org/T345695) (owner: 10Hashar) [14:38:47] (03PS2) 10Elukey: tox.ini: whitelist_externals -> allowlist_externals [software] - 10https://gerrit.wikimedia.org/r/955880 (https://phabricator.wikimedia.org/T345695) (owner: 10Hashar) [14:39:13] !log UTC afternoon B&C window done [14:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:54] (03CR) 10Majavah: [C: 03+1] Remove cloud hosts except clouddb from the "no IPv6 hostname" list [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/975826 (https://phabricator.wikimedia.org/T37947) (owner: 10Cathal Mooney) [14:41:22] RECOVERY - Check systemd state on ganeti2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:48:33] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [14:51:21] (03Abandoned) 10Hashar: Make Scap directories on deployment servers compatible with CVE-2022-24756 fix [puppet] - 10https://gerrit.wikimedia.org/r/912853 (https://phabricator.wikimedia.org/T335354) (owner: 10Muehlenhoff) [14:52:40] (03PS9) 10Jbond: sanitarium_multiinstance: over private_wiki and private_tables vars to hiera [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) [14:52:42] (03PS9) 10Jbond: realm.pp: drop wikimail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971469 (https://phabricator.wikimedia.org/T350008) [14:52:44] (03PS4) 10Jbond: mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) [14:52:46] (03PS9) 10Jbond: airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) [14:52:48] (03PS4) 10Jbond: phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) [14:52:50] (03PS4) 10Jbond: vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) [14:52:52] (03PS9) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) [14:52:54] (03PS9) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) [14:53:22] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:55:35] (03CR) 10Volans: "LGTM, one small typo and two totally optional comments" [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond) [14:55:58] (03CR) 10CI reject: [V: 04-1] mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [14:56:46] (03CR) 10Volans: [C: 03+1] "Thanks for updating the list" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/975826 (https://phabricator.wikimedia.org/T37947) (owner: 10Cathal Mooney) [14:57:21] (03CR) 10Ayounsi: [C: 03+1] Remove cloud hosts except clouddb from the "no IPv6 hostname" list [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/975826 (https://phabricator.wikimedia.org/T37947) (owner: 10Cathal Mooney) [14:58:20] (03CR) 10CI reject: [V: 04-1] airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [14:58:26] (03CR) 10CI reject: [V: 04-1] phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [14:58:58] (03CR) 10CI reject: [V: 04-1] vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [15:05:34] (CirrusSearchJobQueueLagTooHigh) firing: CirrusSearch job cirrusSearchElasticaWrite lag is too high: 9h 4m 14s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchElasticaWrite - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [15:06:31] (03PS1) 10Filippo Giunchedi: team-o11y: alert on Prometheus storing a few days of data [alerts] - 10https://gerrit.wikimedia.org/r/975832 (https://phabricator.wikimedia.org/T351179) [15:08:16] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus-puppet-agent-stats: this timer sometime fails [puppet] - 10https://gerrit.wikimedia.org/r/971946 (owner: 10Jbond) [15:08:21] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:11:18] (03CR) 10Hashar: httpd: ErrorLogFormat for ECS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966645 (https://phabricator.wikimedia.org/T332672) (owner: 10Hashar) [15:11:33] (03PS3) 10Hashar: httpd: ErrorLogFormat for ECS [puppet] - 10https://gerrit.wikimedia.org/r/966645 (https://phabricator.wikimedia.org/T332672) [15:11:36] (03CR) 10Ilias Sarantopoulos: [C: 03+2] team-ml: add alert for memory spike in inf services [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [15:12:28] PROBLEM - Check systemd state on an-druid1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:42] (SystemdUnitFailed) firing: prometheus_puppet_agent_stats.timer Failed on elastic1102:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:12:48] PROBLEM - Check systemd state on mw2423 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:50] PROBLEM - Check systemd state on puppetmaster1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:50] PROBLEM - Check systemd state on dns3004 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:02] PROBLEM - Check systemd state on cp4042 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:06] PROBLEM - Check systemd state on mw1474 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:10] PROBLEM - Check systemd state on kubernetes1006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:11] hmm [15:13:28] PROBLEM - Check systemd state on parse1023 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:28] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:29] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/07bd2eeb0378af08937823ae53fcd056e1d68786 [15:13:40] jbond: ^ this might be it? [15:13:45] ugh that's me I think, sorry about that [15:13:48] PROBLEM - Check systemd state on parse1019 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:50] oh you merged that, right! sorry [15:13:52] PROBLEM - Check systemd state on mw1451 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:52] PROBLEM - Check systemd state on kubernetes1031 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:52] PROBLEM - Check systemd state on logstash1034 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:52] godog: thanks [15:13:58] investigating [15:14:04] PROBLEM - Check systemd state on ms-be1044 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:06] PROBLEM - Check systemd state on mw2427 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:06] PROBLEM - Check systemd state on db2134 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:08] PROBLEM - Check systemd state on cp5031 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:10] PROBLEM - Check systemd state on an-worker1097 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:14] PROBLEM - Check systemd state on db1182 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:18] PROBLEM - Check systemd state on cp5025 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:26] PROBLEM - Check systemd state on mw2313 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:26] PROBLEM - Check systemd state on moss-be2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:30] PROBLEM - Check systemd state on ganeti1027 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:30] PROBLEM - Check systemd state on dumpsdata1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:32] (03Merged) 10jenkins-bot: team-ml: add alert for memory spike in inf services [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [15:14:34] PROBLEM - Check systemd state on db2132 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:46] PROBLEM - Check systemd state on analytics1071 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:46] PROBLEM - Check systemd state on elastic1102 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:48] PROBLEM - Check systemd state on seaborgium is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:48] PROBLEM - Check systemd state on db2112 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:50] PROBLEM - Check systemd state on parse1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:50] PROBLEM - Check systemd state on kubernetes2042 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:52] PROBLEM - Check systemd state on idm2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:56] PROBLEM - Check systemd state on mw2358 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:00] PROBLEM - Check systemd state on db2160 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:00] PROBLEM - Check systemd state on mw2328 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:00] PROBLEM - Check systemd state on ncredir4002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:00] PROBLEM - Check systemd state on cp6008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:02] PROBLEM - Check systemd state on ncredir6002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:04] PROBLEM - Check systemd state on db1212 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:04] so timer exists but service not found? [15:15:06] PROBLEM - Check systemd state on parse2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:06] PROBLEM - Check systemd state on kubestage1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:07] sorry for the spam [15:15:10] PROBLEM - Check systemd state on cp5030 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:12] PROBLEM - Check systemd state on mc1054 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:14] PROBLEM - Check systemd state on an-worker1128 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:18] PROBLEM - Check systemd state on cp3067 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:22] PROBLEM - Check systemd state on mw1454 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:22] PROBLEM - Check systemd state on mw1471 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:26] PROBLEM - Check systemd state on dns6002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:32] PROBLEM - Check systemd state on cp5027 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:34] PROBLEM - Check systemd state on lvs2013 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:36] PROBLEM - Check systemd state on mw2279 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:38] PROBLEM - Check systemd state on elastic2043 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:38] PROBLEM - Check systemd state on an-worker1088 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:41] I'm stopping icinga-wm [15:15:47] (03PS1) 10Ssingh: Revert "prometheus-puppet-agent-stats: this timer sometime fails" [puppet] - 10https://gerrit.wikimedia.org/r/975595 [15:15:50] PROBLEM - Check systemd state on logstash1011 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:52] PROBLEM - Check systemd state on cephosd1005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:56] PROBLEM - Check systemd state on parse2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:57] I have a revert ready if that's helpful for now [15:16:00] PROBLEM - Check systemd state on bast2003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:08] PROBLEM - Check systemd state on mw2309 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:10] PROBLEM - Check systemd state on cp5017 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:14] PROBLEM - Check systemd state on ncredir5001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:18] sukhe: thank you I think nothing is bad is happening, just the spam [15:16:22] PROBLEM - Check systemd state on cp2029 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:23] yp [15:16:28] PROBLEM - Check systemd state on an-worker1089 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:28] PROBLEM - Check systemd state on db2133 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:31] godog: in a meeting put let me know if yuo need a hand [15:16:51] thank you jbond, all good [15:16:56] cool [15:17:07] the cleanup for systemd timer should reset-failed and it doesn't [15:17:38] ahh ack, although that could hide other issues so ... [15:17:42] (SystemdUnitFailed) resolved: (2) prometheus_puppet_agent_stats.timer Failed on cloudelastic1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:17:48] yeah... [15:18:03] not really, you can reset-failed individual units [15:18:23] (03PS1) 10Elukey: profile::pyrra::filesystem: new Lift Wing pilot candidate [puppet] - 10https://gerrit.wikimedia.org/r/975833 (https://phabricator.wikimedia.org/T351390) [15:18:41] godog: ah ok then yes sgtm [15:18:57] I meant icinga-wm in this case :P [15:20:28] heheh fair enough [15:20:34] (CirrusSearchJobQueueLagTooHigh) resolved: CirrusSearch job cirrusSearchElasticaWrite lag is too high: 6h 25m 36s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchElasticaWrite - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [15:22:13] ok I'll let puppet do its thing and then do a cleanup via cumin [15:22:28] godog: ok! hth if required, please let me know (on on-call) [15:22:42] (SystemdUnitFailed) firing: prometheus_puppet_agent_stats.timer Failed on elastic1078:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:22:49] thank you sukhe ! appreciate it [15:23:47] (03PS2) 10Klausman: Clean up additional ORES leftovers [puppet] - 10https://gerrit.wikimedia.org/r/975780 (https://phabricator.wikimedia.org/T347278) [15:27:42] (SystemdUnitFailed) resolved: (3) prometheus_puppet_agent_stats.timer Failed on cloudelastic1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:27:44] sukhe: I'm going to abandon your revert [15:27:52] godog: sure! thanks [15:27:57] (03Abandoned) 10Filippo Giunchedi: Revert "prometheus-puppet-agent-stats: this timer sometime fails" [puppet] - 10https://gerrit.wikimedia.org/r/975595 (owner: 10Ssingh) [15:28:26] (03CR) 10Klausman: [C: 03+2] Clean up additional ORES leftovers [puppet] - 10https://gerrit.wikimedia.org/r/975780 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [15:28:42] (SystemdUnitFailed) firing: prometheus_puppet_agent_stats.timer Failed on elastic1093:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:33:22] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:33:42] (SystemdUnitFailed) resolved: (3) prometheus_puppet_agent_stats.timer Failed on cloudelastic1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:35:03] (03PS11) 10Jbond: puppet: update gat_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) [15:35:07] (03PS1) 10Jdlrobson: Filter translation service errors [puppet] - 10https://gerrit.wikimedia.org/r/975836 [15:35:12] (SystemdUnitFailed) firing: (2) prometheus_puppet_agent_stats.timer Failed on elastic1093:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:35:27] (SystemdUnitFailed) resolved: (2) prometheus_puppet_agent_stats.timer Failed on elastic1078:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:35:32] (03CR) 10Jbond: "updated thanks" [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond) [15:36:11] (03CR) 10Jdlrobson: Filter translation service errors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975836 (owner: 10Jdlrobson) [15:36:41] also in hindsight the approach I suggested only covers puppet invocations via the timer, not via 'run-puppet-agent' to update the prometheus stats [15:37:15] which will be updated at the next automated puppet run anyways, but still [15:40:12] (SystemdUnitFailed) firing: (5) prometheus_puppet_agent_stats.timer Failed on elastic1055:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:41:30] (03PS12) 10Jbond: puppet: update get_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) [15:42:30] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1010.wikimedia.org with OS bullseye [15:42:47] all calm at the moment? I’d like to run a brief maintenance script on testwikidatawiki [15:42:54] (03PS1) 10Ssingh: P:dns::auth::update: add support for setting ferm rules via confd [puppet] - 10https://gerrit.wikimedia.org/r/975843 (https://phabricator.wikimedia.org/T347054) [15:42:54] (I’ll go ahead if nobody objects in a few minutes :)) [15:43:35] !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp1110.eqiad.wmnet [15:43:36] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp1110.eqiad.wmnet [15:44:06] fabfur@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [15:44:33] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/975843 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:44:41] !log swapped cp1110 <-> cp1085 (T349244) [15:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:45] T349244: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 [15:45:12] (SystemdUnitFailed) resolved: (4) prometheus_puppet_agent_stats.timer Failed on elastic1055:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:46:54] ok icinga can come back [15:46:59] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10jbond) 05In progress→03Resolved [15:47:00] Lucas_WMDE: yes +1 [15:47:24] !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp1111.eqiad.wmnet [15:47:25] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp1111.eqiad.wmnet [15:48:18] !log START lucaswerkmeister-wmde@mwmaint2002:~$ mwscript Wikibase.Lexeme.Maintenance.FixPagePropsSortkey testwikidatawiki --batch-size=1000 # T350224 [15:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:24] T350224: [LEX] pp_sortkey is null for wb-claims, wbl-forms and wbl-senses on many Lexemes - https://phabricator.wikimedia.org/T350224 [15:48:24] !log swapped cp1111 <-> cp1086 (T349244) [15:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:39] !log DONE Wikibase.Lexeme.Maintenance.FixPagePropsSortkey (T350224) in 1.079s real time :) [15:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [15:50:05] jbond: in light of what I realized (what I suggested with the timer dependency doesn't cover manual run-puppet-agent runs) I'm happy to go back to your original and much simpler solution, what do you think ? [15:50:31] (03CR) 10Ssingh: [V: 03+1] "This blocks on https://gerrit.wikimedia.org/r/c/operations/puppet/+/975009 but that's fine, we are just preparing the patches based on the" [puppet] - 10https://gerrit.wikimedia.org/r/975843 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:51:08] or add a systemctl start prometheus_puppet_agent_stats to run-puppet-agent FWIW [15:52:04] * Lucas_WMDE done btw [15:52:16] PROBLEM - Check systemd state on an-worker1113 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:22] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10joanna_borun) a:05jbond→03cmooney [15:52:48] 10SRE-tools, 10Infrastructure-Foundations: redfish: minimum version support - https://phabricator.wikimedia.org/T328593 (10joanna_borun) a:05jbond→03Volans [15:54:23] 10SRE-tools, 10Infrastructure-Foundations: Fix autorestart and debclient dependency - https://phabricator.wikimedia.org/T324229 (10jbond) 05Open→03Declined not enough information [15:55:12] (03PS2) 10Elukey: profile::pyrra::filesystem: new Lift Wing pilot candidate [puppet] - 10https://gerrit.wikimedia.org/r/975833 (https://phabricator.wikimedia.org/T351390) [15:55:14] (03PS1) 10Elukey: profile::thanos: change increase() range for Lift Wing [puppet] - 10https://gerrit.wikimedia.org/r/975846 (https://phabricator.wikimedia.org/T351390) [15:55:40] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Investigate converting LBRemoteCluster cookbooks to SRELBBatchRunnerBase - https://phabricator.wikimedia.org/T318787 (10joanna_borun) a:05jbond→03Volans [15:56:39] (03PS2) 10Elukey: profile::thanos: change increase() range for Lift Wing [puppet] - 10https://gerrit.wikimedia.org/r/975846 (https://phabricator.wikimedia.org/T351390) [15:59:40] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [16:00:03] I'll investigate that ^ [16:00:11] godog: thanks! [16:00:14] (03CR) 10CI reject: [V: 04-1] puppet: update get_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond) [16:01:40] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10observability, 10User-jbond: Add monitoring for the puppet-netbox repository - https://phabricator.wikimedia.org/T310639 (10joanna_borun) a:05jbond→03None [16:04:08] (03CR) 10Klausman: [C: 03+1] profile::pyrra::filesystem: new Lift Wing pilot candidate [puppet] - 10https://gerrit.wikimedia.org/r/975833 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [16:04:35] (03CR) 10Klausman: [C: 03+1] profile::thanos: change increase() range for Lift Wing [puppet] - 10https://gerrit.wikimedia.org/r/975846 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [16:06:51] (03CR) 10Mvolz: rest-gateway: add params to config, rework citoid path matching (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973362 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [16:07:19] (03CR) 10Elukey: "As example, in Grizzly we do this for the latency SLO:" [puppet] - 10https://gerrit.wikimedia.org/r/975846 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [16:08:45] (03CR) 10Elukey: profile::thanos: change increase() range for Lift Wing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975846 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [16:09:31] 10SRE, 10CAS-SSO, 10Gerrit, 10Infrastructure-Foundations, and 3 others: Add logout.d script for Gerrit - https://phabricator.wikimedia.org/T286905 (10jbond) a:05jbond→03None [16:09:40] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [16:11:51] 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10User-jbond: Additional CFSSL tasks - https://phabricator.wikimedia.org/T281369 (10jbond) 05Open→03Resolved [16:12:42] 10SRE, 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: Allow idrac ftp fetching of firmware updates (either to existing ftp or new solution) - https://phabricator.wikimedia.org/T283771 (10jbond) 05Open→03Resolved a:05jbond→03None @RobH closing this as we now have the upgrade-firmware cookbook... [16:13:22] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:18:20] (03PS1) 10BCornwall: acme-chief: Remove acmechief2001 passive host [puppet] - 10https://gerrit.wikimedia.org/r/975853 (https://phabricator.wikimedia.org/T342154) [16:19:28] (03PS6) 10Vgutierrez: profile: Provide a lvs::realserver::ipip profile [puppet] - 10https://gerrit.wikimedia.org/r/975342 (https://phabricator.wikimedia.org/T351069) [16:21:39] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Work required to prepare for puppet 7 - https://phabricator.wikimedia.org/T265138 (10jbond) [16:22:07] 10SRE, 10SRE-swift-storage, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): thanos internal TLS failure after puppet 7 update - https://phabricator.wikimedia.org/T351653 (10MatthewVernon) [16:22:18] 10SRE, 10SRE-swift-storage, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): thanos internal TLS failure after puppet 7 update - https://phabricator.wikimedia.org/T351653 (10MatthewVernon) p:05Triage→03High [16:22:31] (03PS1) 10C. Scott Ananian: [parsoid] Fix Parsoid relative links [core] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975596 (https://phabricator.wikimedia.org/T350952) [16:22:50] 10SRE, 10SRE-swift-storage, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): thanos internal TLS failure after puppet 7 update - https://phabricator.wikimedia.org/T351653 (10MatthewVernon) (priority set to high as we do use the swift-dispersion-stats to check for cluster health) [16:24:03] (03CR) 10Subramanya Sastry: [C: 03+1] [parsoid] Fix Parsoid relative links [core] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975596 (https://phabricator.wikimedia.org/T350952) (owner: 10C. Scott Ananian) [16:25:00] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate" - https://phabricator.wikimedia.org/T351624 (10LSobanski) Removing #collaboration-services as I don't see any... [16:25:05] 10SRE, 10SRE-swift-storage, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): thanos internal TLS failure after puppet 7 update - https://phabricator.wikimedia.org/T351653 (10MatthewVernon) [16:25:54] (03CR) 10Volans: puppet: update get_ca_server to also support srv discovery (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond) [16:26:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1210', diff saved to https://phabricator.wikimedia.org/P53638 and previous config saved to /var/cache/conftool/dbconfig/20231120-162648-root.json [16:27:21] ACKNOWLEDGEMENT - MariaDB Replica Lag: s5 on db1210 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 8793.10 seconds Marostegui testing https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:28:44] (03PS1) 10Marostegui: db1210: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/975855 (https://phabricator.wikimedia.org/T351283) [16:28:53] (03PS13) 10Jbond: puppet: update gat_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) [16:28:57] (03CR) 10Jbond: puppet: update gat_ca_server to also support srv discovery (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond) [16:29:45] (03CR) 10Marostegui: [C: 03+2] db1210: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/975855 (https://phabricator.wikimedia.org/T351283) (owner: 10Marostegui) [16:30:05] jan_drewniak: May I have your attention please! Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231120T1630) [16:30:23] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond) [16:30:31] (03PS14) 10Jbond: puppet: update gat_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) [16:33:05] (03CR) 10C. Scott Ananian: "Queued up for the late backport window today." [core] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975596 (https://phabricator.wikimedia.org/T350952) (owner: 10C. Scott Ananian) [16:36:09] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:38:12] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/975843 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [16:38:38] (03CR) 10Jbond: [C: 03+2] puppet: update gat_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond) [16:39:30] (03CR) 10Ladsgroup: "pcc is failing, not sure an issue with pcc or this patch has issues." [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:39:45] jouncebot: nowandnext [16:39:45] For the next 0 hour(s) and 20 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231120T1630) [16:39:45] In 1 hour(s) and 20 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231120T1800) [16:39:45] In 1 hour(s) and 20 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231120T1800) [16:40:02] no updates seems to be happening [16:40:56] (03CR) 10CI reject: [V: 04-1] [parsoid] Fix Parsoid relative links [core] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975596 (https://phabricator.wikimedia.org/T350952) (owner: 10C. Scott Ananian) [16:41:45] (03PS1) 10Vgutierrez: acme-chief: Mask acme-chief.service on passive nodes [puppet] - 10https://gerrit.wikimedia.org/r/975860 (https://phabricator.wikimedia.org/T351655) [16:42:12] (03PS2) 10Ladsgroup: Set pagelinks migration to read new in testwiki, fawikiquote, cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975806 (https://phabricator.wikimedia.org/T351237) [16:42:30] (03CR) 10Ladsgroup: [C: 03+2] Set pagelinks migration to read new in testwiki, fawikiquote, cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975806 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [16:42:55] (03PS2) 10Vgutierrez: acme-chief: Mask acme-chief.service on passive nodes [puppet] - 10https://gerrit.wikimedia.org/r/975860 (https://phabricator.wikimedia.org/T351655) [16:43:14] (03Merged) 10jenkins-bot: Set pagelinks migration to read new in testwiki, fawikiquote, cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975806 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [16:43:21] (03PS10) 10Jbond: sanitarium_multiinstance: over private_wiki and private_tables vars to hiera [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) [16:43:23] (03PS10) 10Jbond: realm.pp: drop wikimail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971469 (https://phabricator.wikimedia.org/T350008) [16:43:25] (03PS5) 10Jbond: mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) [16:43:27] (03PS10) 10Jbond: airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) [16:43:29] (03PS5) 10Jbond: phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) [16:43:31] (03PS5) 10Jbond: vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) [16:43:33] (03PS10) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) [16:43:35] (03PS10) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) [16:44:44] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/593/con" [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:44:48] (03Merged) 10jenkins-bot: puppet: update gat_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond) [16:45:32] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/975860 (https://phabricator.wikimedia.org/T351655) (owner: 10Vgutierrez) [16:45:41] (03CR) 10Ladsgroup: [C: 04-1] "PCC says it's missing mediamoderation_scan" [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:46:15] (03CR) 10Ssingh: [V: 03+1] P:dns::auth::update: add support for setting ferm rules via confd (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/975843 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [16:46:30] (03CR) 10CI reject: [V: 04-1] mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:46:37] (03PS1) 10Jbond: centrallog: update tls_netstream_driver to use ossl [puppet] - 10https://gerrit.wikimedia.org/r/975861 (https://phabricator.wikimedia.org/T324623) [16:46:39] (03PS3) 10Vgutierrez: acme-chief: Mask acme-chief.service on passive nodes [puppet] - 10https://gerrit.wikimedia.org/r/975860 (https://phabricator.wikimedia.org/T351655) [16:46:41] (03PS2) 10Ssingh: P:dns::auth::update: add support for setting ferm rules via confd [puppet] - 10https://gerrit.wikimedia.org/r/975843 (https://phabricator.wikimedia.org/T347054) [16:46:50] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:975806|Set pagelinks migration to read new in testwiki, fawikiquote, cebwiki (T351237)]] [16:46:56] T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237 [16:47:21] (03PS2) 10Jbond: centrallog: update tls_netstream_driver to use ossl [puppet] - 10https://gerrit.wikimedia.org/r/975861 (https://phabricator.wikimedia.org/T324623) [16:47:33] (03CR) 10Elukey: [C: 03+1] Update kserve images to v0.11.2 (new upstream version) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/975848 (owner: 10Klausman) [16:48:08] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:975806|Set pagelinks migration to read new in testwiki, fawikiquote, cebwiki (T351237)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:48:16] (03PS11) 10Jbond: sanitarium_multiinstance: over private_wiki and private_tables vars to hiera [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) [16:48:18] (03PS11) 10Jbond: realm.pp: drop wikimail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971469 (https://phabricator.wikimedia.org/T350008) [16:48:20] (03PS6) 10Jbond: mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) [16:48:22] (03PS11) 10Jbond: airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) [16:48:24] (03PS6) 10Jbond: phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) [16:48:26] (03PS6) 10Jbond: vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) [16:48:28] (03PS11) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) [16:48:31] (03PS11) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) [16:48:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/975861 (https://phabricator.wikimedia.org/T324623) (owner: 10Jbond) [16:49:21] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/597/con" [puppet] - 10https://gerrit.wikimedia.org/r/975860 (https://phabricator.wikimedia.org/T351655) (owner: 10Vgutierrez) [16:51:04] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [16:52:59] (03PS12) 10Jbond: sanitarium_multiinstance: over private_wiki and private_tables vars to hiera [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) [16:53:01] (03PS12) 10Jbond: realm.pp: drop wikimail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971469 (https://phabricator.wikimedia.org/T350008) [16:53:03] (03PS7) 10Jbond: mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) [16:53:05] (03PS12) 10Jbond: airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) [16:53:07] (03PS7) 10Jbond: phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) [16:53:09] (03PS7) 10Jbond: vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) [16:53:11] (03PS12) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) [16:53:13] (03PS12) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) [16:54:18] (03PS4) 10Vgutierrez: acme-chief: Mask acme-chief.service on passive nodes [puppet] - 10https://gerrit.wikimedia.org/r/975860 (https://phabricator.wikimedia.org/T351655) [16:54:36] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/599/console" [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:56:05] (03CR) 10CI reject: [V: 04-1] mail::default_mail_relay: update to have a smarthosts parameter [puppet] - 10https://gerrit.wikimedia.org/r/975820 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:56:49] (03CR) 10Ssingh: [C: 03+1] acme-chief: Mask acme-chief.service on passive nodes [puppet] - 10https://gerrit.wikimedia.org/r/975860 (https://phabricator.wikimedia.org/T351655) (owner: 10Vgutierrez) [16:56:57] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:975806|Set pagelinks migration to read new in testwiki, fawikiquote, cebwiki (T351237)]] (duration: 10m 06s) [16:57:02] T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237 [16:58:01] (03CR) 10CI reject: [V: 04-1] airflow: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:58:44] (03CR) 10CI reject: [V: 04-1] phabricator: convert to pull from profile::mail::default_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/975821 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:59:01] (03CR) 10CI reject: [V: 04-1] vtrs: update vrts to use configure smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/975822 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [17:02:34] (CirrusSearchJobQueueLagTooHigh) firing: CirrusSearch job cirrusSearchLinksUpdate lag is too high: 7h 20m 19s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [17:03:21] (03PS3) 10Muehlenhoff: Create a new crm-root group and apply to crm hosts [puppet] - 10https://gerrit.wikimedia.org/r/975242 (https://phabricator.wikimedia.org/T349402) [17:03:28] 10SRE, 10SRE-swift-storage, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): thanos internal TLS failure after puppet 7 update - https://phabricator.wikimedia.org/T351653 (10MatthewVernon) [it was suggested I added jbond to this task] [17:05:10] (03CR) 10Klausman: [V: 03+2 C: 03+2] Update kserve images to v0.11.2 (new upstream version) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/975848 (owner: 10Klausman) [17:05:41] (03PS4) 10Muehlenhoff: Create a new crm-root group and apply to crm hosts [puppet] - 10https://gerrit.wikimedia.org/r/975242 (https://phabricator.wikimedia.org/T349402) [17:06:05] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:06:38] (03PS1) 10Ottomata: changeprop - remove prometheus metrics config for beta values [deployment-charts] - 10https://gerrit.wikimedia.org/r/975862 (https://phabricator.wikimedia.org/T351247) [17:06:57] PROBLEM - Host kubernetes2041 is DOWN: PING CRITICAL - Packet loss = 100% [17:07:04] (03CR) 10Muehlenhoff: "This was approved in today's SRE IF meeting." [puppet] - 10https://gerrit.wikimedia.org/r/975242 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff) [17:08:27] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:08:59] (03CR) 10Volans: "quick naming question inline" [puppet] - 10https://gerrit.wikimedia.org/r/975242 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff) [17:09:28] (03PS1) 10Volans: CHANGELOG: add changelogs for release v8.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/975863 [17:10:31] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v8.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/975863 (owner: 10Volans) [17:10:33] (03CR) 10Muehlenhoff: Create a new crm-root group and apply to crm hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975242 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff) [17:10:54] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1010.wikimedia.org with OS bullseye [17:12:34] (KubernetesCalicoDown) firing: kubernetes2041.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2041.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:14:10] (03CR) 10Volans: [C: 03+1] "patch LGTM, naming bikeshedding apart :)" [puppet] - 10https://gerrit.wikimedia.org/r/975242 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff) [17:14:51] (03CR) 10Vgutierrez: [C: 03+2] acme-chief: Mask acme-chief.service on passive nodes [puppet] - 10https://gerrit.wikimedia.org/r/975860 (https://phabricator.wikimedia.org/T351655) (owner: 10Vgutierrez) [17:15:20] (03PS2) 10Ottomata: changeprop - remove prometheus metrics config for beta values [deployment-charts] - 10https://gerrit.wikimedia.org/r/975862 (https://phabricator.wikimedia.org/T351247) [17:15:59] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: durum [17:16:26] (03CR) 10Vgutierrez: [C: 03+1] "LGTM! don't forget to run puppet on acmechief1001 before reimaging acmechief2001" [puppet] - 10https://gerrit.wikimedia.org/r/975853 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [17:17:33] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v8.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/975863 (owner: 10Volans) [17:17:34] (CirrusSearchJobQueueLagTooHigh) resolved: CirrusSearch job cirrusSearchLinksUpdate lag is too high: 6h 27m 43s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [17:17:45] (03PS1) 10Muehlenhoff: Switch durum to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975864 (https://phabricator.wikimedia.org/T349619) [17:18:11] jouncebot: now [17:18:11] No deployments scheduled for the next 0 hour(s) and 41 minute(s) [17:18:25] !log Restarting Gerrit # T351658 [17:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:33] T351658: gerrit1003 root partition filing up - https://phabricator.wikimedia.org/T351658 [17:20:27] https://gerrit.wikimedia.org/r/ Gerrit seems down [17:20:34] T351658 [17:20:49] oh it is restarting sorry I just read above [17:21:36] (03PS1) 10Volans: Upstream release v8.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/975866 [17:21:54] (03CR) 10Muehlenhoff: [C: 03+2] Switch durum to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975864 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [17:22:04] (03PS3) 10Ottomata: changeprop - fixes for beta values [deployment-charts] - 10https://gerrit.wikimedia.org/r/975862 (https://phabricator.wikimedia.org/T351247) [17:22:28] (03CR) 10Volans: [C: 03+2] Upstream release v8.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/975866 (owner: 10Volans) [17:25:37] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1010.wikimedia.org with reason: host reimage [17:28:08] (03PS3) 10Jforrester: wikifunctions: Switch orchestrator to 2023-11-06-172159 [deployment-charts] - 10https://gerrit.wikimedia.org/r/971999 (https://phabricator.wikimedia.org/T297509) [17:28:10] (03PS1) 10Jforrester: wikifunctions: Bump evaluators to 2023-11-20-171133 [deployment-charts] - 10https://gerrit.wikimedia.org/r/975867 (https://phabricator.wikimedia.org/T349385) [17:28:41] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1010.wikimedia.org with reason: host reimage [17:29:17] 10SRE, 10SRE-swift-storage, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): thanos internal TLS failure after puppet 7 update - https://phabricator.wikimedia.org/T351653 (10jbond) @MatthewVernon this is almost certainly something using the the puppet ca directly instead of using `/etc/ssl/certs/wmf-ca... [17:29:29] (03Merged) 10jenkins-bot: Upstream release v8.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/975866 (owner: 10Volans) [17:30:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: durum [17:32:48] !log uploaded spicerack_8.1.0 to apt.wikimedia.org bullseye-wikimedia [17:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:03] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T351663 (10phaultfinder) [17:35:07] jouncebot: nowandnext [17:35:08] No deployments scheduled for the next 0 hour(s) and 24 minute(s) [17:35:08] In 0 hour(s) and 24 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231120T1800) [17:35:08] In 0 hour(s) and 24 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231120T1800) [17:35:14] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Bump evaluators to 2023-11-20-171133 [deployment-charts] - 10https://gerrit.wikimedia.org/r/975867 (https://phabricator.wikimedia.org/T349385) (owner: 10Jforrester) [17:35:48] cp2035 powersupplyfailure [17:36:08] (03Merged) 10jenkins-bot: wikifunctions: Bump evaluators to 2023-11-20-171133 [deployment-charts] - 10https://gerrit.wikimedia.org/r/975867 (https://phabricator.wikimedia.org/T349385) (owner: 10Jforrester) [17:37:20] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [17:39:23] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1035.mgmt.eqiad.wmnet with reboot policy FORCED [17:39:41] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1035.mgmt.eqiad.wmnet with reboot policy FORCED [17:42:37] (03PS1) 10Jbond: Puppet_Internal_CA.pem: rename to Puppet5_Internal_CA.pem [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/975869 (https://phabricator.wikimedia.org/T351653) [17:43:54] (03PS2) 10Jbond: Puppet_Internal_CA.pem: rename to Puppet5_Internal_CA.pem [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/975869 (https://phabricator.wikimedia.org/T351653) [17:47:53] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [17:48:31] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2087-2091 - https://phabricator.wikimedia.org/T349778 (10Jhancock.wm) a:03Jhancock.wm [17:49:37] (03PS1) 10Hashar: gerrit: prevent access from a misbehaving IP [puppet] - 10https://gerrit.wikimedia.org/r/975870 (https://phabricator.wikimedia.org/T351658) [17:49:55] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Jhancock.wm) a:03Jhancock.wm [17:49:59] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1010.wikimedia.org with OS bullseye [17:51:06] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10Jhancock.wm) a:03Jhancock.wm [17:52:08] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logging-hd200[1-3] - https://phabricator.wikimedia.org/T349834 (10Jhancock.wm) a:03Jhancock.wm [17:52:43] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10Jhancock.wm) [17:53:43] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti203[34] - https://phabricator.wikimedia.org/T349926 (10Jhancock.wm) a:03Jhancock.wm [17:54:37] (03PS1) 10Ebernhardson: cirrus updater: Supply config as yaml file [deployment-charts] - 10https://gerrit.wikimedia.org/r/975872 [17:54:47] (03CR) 10CI reject: [V: 04-1] cirrus updater: Supply config as yaml file [deployment-charts] - 10https://gerrit.wikimedia.org/r/975872 (owner: 10Ebernhardson) [17:55:38] (03PS2) 10Ebernhardson: cirrus updater: Supply config as yaml file [deployment-charts] - 10https://gerrit.wikimedia.org/r/975872 [17:55:40] (03PS3) 10Ebernhardson: cirrus updater: Remove consumer start time override [deployment-charts] - 10https://gerrit.wikimedia.org/r/975321 [17:55:46] (03CR) 10CI reject: [V: 04-1] cirrus updater: Supply config as yaml file [deployment-charts] - 10https://gerrit.wikimedia.org/r/975872 (owner: 10Ebernhardson) [17:55:49] (03CR) 10CI reject: [V: 04-1] cirrus updater: Remove consumer start time override [deployment-charts] - 10https://gerrit.wikimedia.org/r/975321 (owner: 10Ebernhardson) [17:56:05] (03PS3) 10Ebernhardson: cirrus updater: Supply config as yaml file [deployment-charts] - 10https://gerrit.wikimedia.org/r/975872 [17:56:16] (03CR) 10Ssingh: [C: 03+1] gerrit: prevent access from a misbehaving IP [puppet] - 10https://gerrit.wikimedia.org/r/975870 (https://phabricator.wikimedia.org/T351658) (owner: 10Hashar) [17:56:19] (03CR) 10Ssingh: [V: 03+2 C: 03+2] gerrit: prevent access from a misbehaving IP [puppet] - 10https://gerrit.wikimedia.org/r/975870 (https://phabricator.wikimedia.org/T351658) (owner: 10Hashar) [17:56:45] (03PS4) 10Jforrester: wikifunctions: Switch orchestrator to 2023-11-06-172159 [deployment-charts] - 10https://gerrit.wikimedia.org/r/971999 (https://phabricator.wikimedia.org/T297509) [17:56:47] (03PS1) 10Jforrester: wikifunctions: Reduce drain time from 600s default to 60s [deployment-charts] - 10https://gerrit.wikimedia.org/r/975873 [17:58:00] (03CR) 10Ottomata: "I am already in the process of deploying this to beta following https://wikitech.wikimedia.org/wiki/Changeprop#To_deployment-prep." [deployment-charts] - 10https://gerrit.wikimedia.org/r/975862 (https://phabricator.wikimedia.org/T351247) (owner: 10Ottomata) [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231120T1800) [18:00:04] ryankemper: Your horoscope predicts another unfortunate Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231120T1800). [18:01:57] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Supply config as yaml file [deployment-charts] - 10https://gerrit.wikimedia.org/r/975872 (owner: 10Ebernhardson) [18:02:11] (03PS4) 10Ottomata: changeprop - fixes for beta values [deployment-charts] - 10https://gerrit.wikimedia.org/r/975862 (https://phabricator.wikimedia.org/T351247) [18:02:53] (03Merged) 10jenkins-bot: cirrus updater: Supply config as yaml file [deployment-charts] - 10https://gerrit.wikimedia.org/r/975872 (owner: 10Ebernhardson) [18:06:08] (03CR) 10Muehlenhoff: [C: 03+2] Create a new crm-root group and apply to crm hosts [puppet] - 10https://gerrit.wikimedia.org/r/975242 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff) [18:06:35] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [18:06:45] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:08:57] !log start test backfill of 4 days of itwiki and frwiki edits to relforge from cirrus updater [18:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:24] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10vm-requests, 10Patch-For-Review: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10MoritzMuehlenhoff) >>! In T349402#9340355, @MoritzMuehlenhoff wrote: > crm2001.codfw.wmnet has been created and configured... [18:10:31] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti203[34] - https://phabricator.wikimedia.org/T349926 (10MoritzMuehlenhoff) @Jhancock.wm Please note that these need to have virtualization enabled during provisioning, they will be used as virtualisation servers. [18:10:33] (03CR) 10Cathal Mooney: [C: 03+2] Remove cloud hosts except clouddb from the "no IPv6 hostname" list [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/975826 (https://phabricator.wikimedia.org/T37947) (owner: 10Cathal Mooney) [18:11:07] (03Merged) 10jenkins-bot: Remove cloud hosts except clouddb from the "no IPv6 hostname" list [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/975826 (https://phabricator.wikimedia.org/T37947) (owner: 10Cathal Mooney) [18:13:37] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: wikidough [18:16:31] (03PS1) 10Muehlenhoff: Switch wikidough to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975877 (https://phabricator.wikimedia.org/T349619) [18:17:09] (03CR) 10Ssingh: [C: 03+1] Switch wikidough to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975877 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [18:17:11] (03CR) 10Muehlenhoff: [C: 03+2] Switch wikidough to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975877 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [18:18:42] (03CR) 10BCornwall: [C: 03+2] acme-chief: Remove acmechief2001 passive host [puppet] - 10https://gerrit.wikimedia.org/r/975853 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [18:18:44] !log installed spicerack v8.1.0 on the cumin hosts [18:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:47] vgutierrez: ^^^ [18:18:48] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [18:25:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wikidough [18:27:27] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host acmechief2001.codfw.wmnet with OS bookworm [18:33:57] (03PS1) 10Jdlrobson: Disable drawer temporarily while erroring [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975879 (https://phabricator.wikimedia.org/T351669) [18:34:33] (03PS2) 10Jdlrobson: Disable drawer temporarily while erroring [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975879 (https://phabricator.wikimedia.org/T351669) [18:35:49] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [18:37:33] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [18:37:40] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:38:34] !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) [18:39:00] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [18:41:45] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on acmechief2001.codfw.wmnet with reason: host reimage [18:44:22] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on acmechief2001.codfw.wmnet with reason: host reimage [18:44:35] (03CR) 10Subramanya Sastry: [C: 03+1] "recheck" [core] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975596 (https://phabricator.wikimedia.org/T350952) (owner: 10C. Scott Ananian) [18:48:42] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cp4045.ulsfo.wmnet [18:52:12] (03PS1) 10Muehlenhoff: Switch cp4045 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975881 (https://phabricator.wikimedia.org/T349619) [18:52:43] (03CR) 10Muehlenhoff: [C: 03+2] Switch cp4045 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975881 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [18:57:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cp4045.ulsfo.wmnet [18:59:26] !log cmooney@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [18:59:30] !log cmooney@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [18:59:38] !log cmooney@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [18:59:44] !log cmooney@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [19:01:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/975869 (https://phabricator.wikimedia.org/T351653) (owner: 10Jbond) [19:01:47] (03PS1) 10Jforrester: wikifunctions: Roll back Python evaluator to working version [deployment-charts] - 10https://gerrit.wikimedia.org/r/975882 [19:01:51] (03CR) 10Krinkle: [C: 03+1] mc: Make it possible to use mcrouter server set by environment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) (owner: 10D3r1ck01) [19:02:23] !log depool cp4045 for reboot [19:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:14] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [19:03:34] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host acmechief2001.codfw.wmnet with OS bookworm [19:04:22] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [19:05:31] PROBLEM - Check systemd state on kubernetes1028 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:05:45] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4045.ulsfo.wmnet [19:08:21] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:16:02] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4045.ulsfo.wmnet [19:21:41] !log pool cp4045.ulsfo.wmnet post reboot and puppet 7 upgrade [19:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:10] (03PS43) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [19:28:12] (03PS1) 10AOkoth: vrts: add v6.4.5 required packages [puppet] - 10https://gerrit.wikimedia.org/r/975906 [19:28:38] (03PS2) 10AOkoth: vrts: add v6.4.5 required packages [puppet] - 10https://gerrit.wikimedia.org/r/975906 (https://phabricator.wikimedia.org/T349349) [19:29:06] (03PS3) 10AOkoth: vrts: add v6.4.5 required packages [puppet] - 10https://gerrit.wikimedia.org/r/975906 (https://phabricator.wikimedia.org/T349349) [19:29:33] (03CR) 10CI reject: [V: 04-1] vrts: add v6.4.5 required packages [puppet] - 10https://gerrit.wikimedia.org/r/975906 (https://phabricator.wikimedia.org/T349349) (owner: 10AOkoth) [19:30:45] (03PS4) 10AOkoth: vrts: add v6.4.5 required packages [puppet] - 10https://gerrit.wikimedia.org/r/975906 (https://phabricator.wikimedia.org/T349349) [19:33:22] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:36:54] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1013.eqiad.wmnet with OS bullseye [19:36:59] (03CR) 10Dzahn: [C: 03+1] "yea, confirmed they require it now and it's in Debian" [puppet] - 10https://gerrit.wikimedia.org/r/975906 (https://phabricator.wikimedia.org/T349349) (owner: 10AOkoth) [19:44:17] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:46:32] (03PS1) 10BCornwall: Revert "acme-chief: Remove acmechief2001 passive host" [puppet] - 10https://gerrit.wikimedia.org/r/975887 [19:48:06] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1013.eqiad.wmnet with reason: host reimage [19:48:18] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T348183)', diff saved to https://phabricator.wikimedia.org/P53642 and previous config saved to /var/cache/conftool/dbconfig/20231120-194818-arnaudb.json [19:48:23] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [19:50:46] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1013.eqiad.wmnet with reason: host reimage [19:55:38] (03CR) 10BCornwall: [C: 03+2] Revert "acme-chief: Remove acmechief2001 passive host" [puppet] - 10https://gerrit.wikimedia.org/r/975887 (owner: 10BCornwall) [19:59:16] (03CR) 10Jsn.sherman: "thanks for your work on this!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975107 (https://phabricator.wikimedia.org/T349635) (owner: 10MPGuy2824) [19:59:29] !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for acmechief2001.codfw.wmnet [19:59:30] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2001.codfw.wmnet [19:59:33] (03CR) 10Jsn.sherman: Disable PageTriage's extended features on beta testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975107 (https://phabricator.wikimedia.org/T349635) (owner: 10MPGuy2824) [20:02:05] RECOVERY - Check systemd state on kubernetes1028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:03:25] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P53643 and previous config saved to /var/cache/conftool/dbconfig/20231120-200324-arnaudb.json [20:04:06] (03PS1) 10BCornwall: acme-chief: Remove acmechief2002 passive host [puppet] - 10https://gerrit.wikimedia.org/r/975911 (https://phabricator.wikimedia.org/T342154) [20:07:21] (03CR) 10Ladsgroup: [C: 03+1] sanitarium_multiinstance: over private_wiki and private_tables vars to hiera [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [20:08:39] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1013.eqiad.wmnet with OS bullseye [20:09:32] (03PS3) 10Jdrewniak: Disable drawer temporarily while erroring [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975879 (https://phabricator.wikimedia.org/T351669) (owner: 10Jdlrobson) [20:10:28] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1014.eqiad.wmnet with OS bullseye [20:12:51] (03PS4) 10Samtar: Enable action blocks in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973795 (https://phabricator.wikimedia.org/T351048) (owner: 10Saint Johann) [20:13:22] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:14:34] (03PS2) 10JHathaway: puppetserver: remove log spam from user home dir sync [puppet] - 10https://gerrit.wikimedia.org/r/975080 (https://phabricator.wikimedia.org/T351465) [20:15:58] (03CR) 10JHathaway: puppetserver: remove log spam from user home dir sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975080 (https://phabricator.wikimedia.org/T351465) (owner: 10JHathaway) [20:18:31] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P53644 and previous config saved to /var/cache/conftool/dbconfig/20231120-201831-arnaudb.json [20:18:49] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/975080 (https://phabricator.wikimedia.org/T351465) (owner: 10JHathaway) [20:21:17] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs1014.eqiad.wmnet with OS bullseye [20:21:36] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1014.eqiad.wmnet with OS bullseye [20:22:13] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 69, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:22:37] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:24:59] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:27:35] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:29:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts - https://phabricator.wikimedia.org/T349875 (10Eevans) [20:30:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts - https://phabricator.wikimedia.org/T349875 (10Eevans) >>! In T349875#9286181, @RobH wrote: >>>! In T348021#9281147, @Kappakayala wrote: >> @Clement_Goubert / @Joe could one of you help with the racking details? > > I'v... [20:31:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (eqiad) - https://phabricator.wikimedia.org/T349875 (10Eevans) [20:32:53] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Eevans) [20:33:38] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T348183)', diff saved to https://phabricator.wikimedia.org/P53645 and previous config saved to /var/cache/conftool/dbconfig/20231120-203337-arnaudb.json [20:33:40] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [20:33:51] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [20:33:53] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [20:33:57] (03PS1) 10Ladsgroup: Undeploy DoubleWiki, Part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975912 (https://phabricator.wikimedia.org/T351675) [20:34:42] jouncebot: nowandnext [20:34:42] No deployments scheduled for the next 0 hour(s) and 25 minute(s) [20:34:42] In 0 hour(s) and 25 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231120T2100) [20:35:27] (03CR) 10JHathaway: [C: 03+2] puppetserver: remove log spam from user home dir sync [puppet] - 10https://gerrit.wikimedia.org/r/975080 (https://phabricator.wikimedia.org/T351465) (owner: 10JHathaway) [20:36:40] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Eevans) [20:37:33] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1014.eqiad.wmnet with reason: host reimage [20:40:31] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1014.eqiad.wmnet with reason: host reimage [20:42:41] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [20:45:49] (03PS1) 10Eevans: install_server: partman recipe for new sessionstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/975913 (https://phabricator.wikimedia.org/T349875) [20:46:10] (03CR) 10JHathaway: [C: 03+1] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/971469 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [20:46:52] (03CR) 10Eevans: "Hugh, can you sanity check this? I'm basing it off of what was done in https://gerrit.wikimedia.org/r/c/operations/puppet/+/770984" [puppet] - 10https://gerrit.wikimedia.org/r/975913 (https://phabricator.wikimedia.org/T349875) (owner: 10Eevans) [20:47:21] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:47:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q2:rack/setup/install 3 sessionstore hosts (eqiad) - https://phabricator.wikimedia.org/T349875 (10Eevans) [20:48:06] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Eevans) [20:51:47] PROBLEM - Check systemd state on an-worker1156 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:58:37] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:59:37] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to snap out of that daydream and deploy UTC late backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231120T2100). [21:00:04] TheresNoTime, cscott, and jan_drewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:01:17] I can do the deployment today [21:01:43] ah cool, I was just about to say I can only be around for a bit :) [21:01:59] Well let's start with your patch then! [21:02:00] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1014.eqiad.wmnet with OS bullseye [21:02:12] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs1014.eqiad.wmnet [21:02:13] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs1014.eqiad.wmnet [21:02:35] (03CR) 10Catrope: [C: 03+2] Enable action blocks in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973795 (https://phabricator.wikimedia.org/T351048) (owner: 10Saint Johann) [21:02:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973795 (https://phabricator.wikimedia.org/T351048) (owner: 10Saint Johann) [21:03:20] (03Merged) 10jenkins-bot: Enable action blocks in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973795 (https://phabricator.wikimedia.org/T351048) (owner: 10Saint Johann) [21:03:36] !log catrope@deploy2002 Started scap: Backport for [[gerrit:973795|Enable action blocks in ruwiki (T351048)]] [21:03:41] T351048: Enable action blocks in Russian Wikipedia - https://phabricator.wikimedia.org/T351048 [21:05:08] (03PS4) 10Catrope: Disable drawer temporarily while erroring [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975879 (https://phabricator.wikimedia.org/T351669) (owner: 10Jdlrobson) [21:05:28] !log catrope@deploy2002 catrope and stjn: Backport for [[gerrit:973795|Enable action blocks in ruwiki (T351048)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:05:38] testing :) [21:05:42] jan_drewniak, Jdlrobson: Is either of you around for the deployment of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/975879 and https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/975366/ ? [21:06:05] o/ I'm here [21:06:33] RoanKattouw: tested, lgtm :) [21:06:42] !log catrope@deploy2002 catrope and stjn: Continuing with sync [21:12:29] !log catrope@deploy2002 Finished scap: Backport for [[gerrit:973795|Enable action blocks in ruwiki (T351048)]] (duration: 08m 52s) [21:12:36] T351048: Enable action blocks in Russian Wikipedia - https://phabricator.wikimedia.org/T351048 [21:12:49] (KubernetesCalicoDown) firing: kubernetes2041.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2041.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:13:38] RoanKattouw: confirmed in prod, thank you :) [21:13:48] Great! [21:13:55] Let's move on to jan_drewniak's patches [21:14:22] (03PS5) 10Catrope: Disable MobileFrontend AMC drawer temporarily while erroring [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975879 (https://phabricator.wikimedia.org/T351669) (owner: 10Jdlrobson) [21:14:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975879 (https://phabricator.wikimedia.org/T351669) (owner: 10Jdlrobson) [21:15:50] (03Merged) 10jenkins-bot: Disable MobileFrontend AMC drawer temporarily while erroring [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975879 (https://phabricator.wikimedia.org/T351669) (owner: 10Jdlrobson) [21:16:05] !log catrope@deploy2002 Started scap: Backport for [[gerrit:975879|Disable MobileFrontend AMC drawer temporarily while erroring (T351669)]] [21:16:10] T351669: Disable the AMC drawer - https://phabricator.wikimedia.org/T351669 [21:17:24] !log catrope@deploy2002 catrope and jdlrobson: Backport for [[gerrit:975879|Disable MobileFrontend AMC drawer temporarily while erroring (T351669)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:30:47] jan_drewniak: Please test on the test servers [21:31:07] (sorry for the delay, I'm WFH and something was going wrong in the house) [21:32:28] * jan_drewniak RoanKattouw: well I hope your house is ok! but yes, looks good to sync [21:32:34] !log catrope@deploy2002 catrope and jdlrobson: Continuing with sync [21:33:03] Yes everything is OK now :) [21:34:16] sorry i'm late!  lost track of time, but i had a patch on the backport queue if it's not too late. [21:38:16] !log catrope@deploy2002 Finished scap: Backport for [[gerrit:975879|Disable MobileFrontend AMC drawer temporarily while erroring (T351669)]] (duration: 22m 11s) [21:38:21] T351669: Disable the AMC drawer - https://phabricator.wikimedia.org/T351669 [21:38:24] No worries cscott, I had to step out for 15 mins in the middle of a deployment so we're behind schedule for that reason [21:38:30] I have another patch to deploy for jan_drewniak and then you're next [21:38:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by catrope@deploy2002 using scap backport" [skins/Vector] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975366 (https://phabricator.wikimedia.org/T349622) (owner: 10Jdlrobson) [21:40:45] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:41:53] PROBLEM - Check systemd state on mw2442 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:47:03] RECOVERY - Check systemd state on an-worker1156 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:50:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q2:rack/setup/install 3 sessionstore hosts (eqiad) - https://phabricator.wikimedia.org/T349875 (10RobH) a:05Clement_Goubert→03None [21:51:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q2:rack/setup/install 3 sessionstore hosts (eqiad) - https://phabricator.wikimedia.org/T349875 (10RobH) [21:57:46] (03Merged) 10jenkins-bot: Revert "mw.notify: Limit width of overlay to max-width-page-container" [skins/Vector] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975366 (https://phabricator.wikimedia.org/T349622) (owner: 10Jdlrobson) [21:58:00] !log catrope@deploy2002 Started scap: Backport for [[gerrit:975366|Revert "mw.notify: Limit width of overlay to max-width-page-container" (T349622)]] [21:58:04] T349622: Notifications (mw.notify) are a long way from the content on wide screens - https://phabricator.wikimedia.org/T349622 [21:59:22] !log catrope@deploy2002 jdlrobson and catrope: Backport for [[gerrit:975366|Revert "mw.notify: Limit width of overlay to max-width-page-container" (T349622)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:00:05] Reedy, sbassett, Maryum, and manfredi: Dear deployers, time to do the Weekly Security deployment window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231120T2200). [22:07:45] jan_drewniak: Please test on the test servers [22:07:51] (sorry I missed the bot's announcement 8 minutes ago) [22:09:23] RoanKattouw: good to sync [22:09:29] !log catrope@deploy2002 jdlrobson and catrope: Continuing with sync [22:09:40] (03CR) 10Catrope: [C: 03+2] [parsoid] Fix Parsoid relative links [core] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975596 (https://phabricator.wikimedia.org/T350952) (owner: 10C. Scott Ananian) [22:09:54] Meanwhile I'm getting Jenkins started early on the Parsoid patch because it takes forever [22:15:41] !log catrope@deploy2002 Finished scap: Backport for [[gerrit:975366|Revert "mw.notify: Limit width of overlay to max-width-page-container" (T349622)]] (duration: 17m 40s) [22:15:52] T349622: Notifications (mw.notify) are a long way from the content on wide screens - https://phabricator.wikimedia.org/T349622 [22:15:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by catrope@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975596 (https://phabricator.wikimedia.org/T350952) (owner: 10C. Scott Ananian) [22:16:12] jan_drewniak: Your deploys are done [22:16:23] * jan_drewniak RoanKattouw: thank you! [22:16:39] Ugh and we lost cscott [22:17:10] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [22:18:12] Aborting his deploy since he's not here [22:19:07] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update entries for cloud hosts. - cmooney@cumin1001" [22:19:58] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update entries for cloud hosts. - cmooney@cumin1001" [22:19:59] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:20:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by catrope@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975596 (https://phabricator.wikimedia.org/T350952) (owner: 10C. Scott Ananian) [22:20:14] sorry i'm back [22:20:28] 10SRE, 10Growth-Team, 10MediaWiki-Core-AuthManager, 10MediaWiki-Platform-Team, and 3 others: Account creation attempt on mobile Wikipedia domain leads user to desktop Special:CentralLogin/complete, often in logged-out state - https://phabricator.wikimedia.org/T335125 (10matmarex) 05Open→03Resolved Look... [22:27:08] (03Merged) 10jenkins-bot: [parsoid] Fix Parsoid relative links [core] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975596 (https://phabricator.wikimedia.org/T350952) (owner: 10C. Scott Ananian) [22:27:14] w00t! [22:27:22] !log catrope@deploy2002 Started scap: Backport for [[gerrit:975596|[parsoid] Fix Parsoid relative links (T350952)]] [22:27:27] T350952: After fixing subpage links in Parsoid read views, TOC links are broken - https://phabricator.wikimedia.org/T350952 [22:28:40] !log catrope@deploy2002 catrope and cscott: Backport for [[gerrit:975596|[parsoid] Fix Parsoid relative links (T350952)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:29:11] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:35:15] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10RobH) [22:36:02] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10RobH) a:05Clement_Goubert→03None [22:36:23] RoanKattouw: let me know when the patch is live on a canary [22:36:43] Ugh it already has been for 8 minutes, sorry for the delay [22:36:53] The shell command makes noises but not at the time that I need it the most [22:41:08] RoanKattouw: ok, I verified that it seems to be working on the canary [22:41:15] Great, let's roll it out [22:41:18] !log catrope@deploy2002 catrope and cscott: Continuing with sync [22:41:36] thanks! [22:46:54] !log catrope@deploy2002 Finished scap: Backport for [[gerrit:975596|[parsoid] Fix Parsoid relative links (T350952)]] (duration: 19m 32s) [22:46:59] T350952: After fixing subpage links in Parsoid read views, TOC links are broken - https://phabricator.wikimedia.org/T350952 [22:47:09] All done! [22:47:58] looks good even without x-wikimedia-debug turned on now, thanks! [23:08:21] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [23:33:22] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown