[01:03:41] PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:16:25] RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:46:23] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Mon 20 Feb 2023 05:31:14 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:48:11] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 21 Apr 2023 05:11:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:10:46] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:44] (03PS2) 10Zabe: eventlogging: drop absented check_eventlogging_jobs file [puppet] - 10https://gerrit.wikimedia.org/r/879854 [02:14:53] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Mon 20 Feb 2023 05:31:14 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:16:37] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 21 Apr 2023 05:11:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:20:46] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:28:06] (03PS2) 10Zabe: wikitech_private: convert to new array syntax [puppet] - 10https://gerrit.wikimedia.org/r/779860 [03:14:39] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Mon 20 Feb 2023 05:31:14 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:16:27] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 21 Apr 2023 05:11:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:21:55] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Mon 20 Feb 2023 05:31:14 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:23:45] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 21 Apr 2023 05:11:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:52:59] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Mon 20 Feb 2023 05:31:14 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:56:39] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 21 Apr 2023 05:11:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:36:35] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Mon 20 Feb 2023 05:31:14 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:38:25] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 21 Apr 2023 05:11:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:49:13] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:16:10] (03PS3) 10Marostegui: raid_handler: Use universal_newlines [puppet] - 10https://gerrit.wikimedia.org/r/888212 (https://phabricator.wikimedia.org/T325046) (owner: 10Muehlenhoff) [06:16:23] (03CR) 10Marostegui: [C: 03+1] raid_handler: Use universal_newlines [puppet] - 10https://gerrit.wikimedia.org/r/888212 (https://phabricator.wikimedia.org/T325046) (owner: 10Muehlenhoff) [06:16:58] 10ops-codfw, 10DBA: db2181 crashed - https://phabricator.wikimedia.org/T328623 (10Marostegui) Thanks - let me know when the main board arrives so we can shut down the host for you. [06:18:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2118.codfw.wmnet with reason: Maintenance [06:18:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2118.codfw.wmnet with reason: Maintenance [06:20:54] (03PS1) 10Marostegui: mariadb: Promote db1164 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/888359 (https://phabricator.wikimedia.org/T329259) [06:21:25] (03PS2) 10Marostegui: mariadb: Promote db1164 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/888359 (https://phabricator.wikimedia.org/T329259) [06:21:52] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover time" [puppet] - 10https://gerrit.wikimedia.org/r/888359 (https://phabricator.wikimedia.org/T329259) (owner: 10Marostegui) [06:22:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1136.eqiad.wmnet with reason: Maintenance [06:22:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1136.eqiad.wmnet with reason: Maintenance [06:24:02] (03PS1) 10Marostegui: mariadb: Test db1164 in m1 [puppet] - 10https://gerrit.wikimedia.org/r/888395 (https://phabricator.wikimedia.org/T329259) [06:24:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2098.codfw.wmnet with reason: Maintenance [06:25:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2098.codfw.wmnet with reason: Maintenance [06:25:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2107.codfw.wmnet with reason: Maintenance [06:25:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2107.codfw.wmnet with reason: Maintenance [06:25:27] (03CR) 10Marostegui: [C: 03+2] mariadb: Test db1164 in m1 [puppet] - 10https://gerrit.wikimedia.org/r/888395 (https://phabricator.wikimedia.org/T329259) (owner: 10Marostegui) [06:28:32] (03CR) 10Kongpcmail: "kongpcmail@gmail.com" [puppet] - 10https://gerrit.wikimedia.org/r/888347 (https://phabricator.wikimedia.org/T329467) (owner: 10Majavah) [06:30:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance [06:30:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance [06:31:06] o.0 kongpcmail [06:31:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2101.codfw.wmnet with reason: Maintenance [06:31:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2101.codfw.wmnet with reason: Maintenance [06:31:50] (03PS1) 10Marostegui: Revert "mariadb: Test db1164 in m1" [puppet] - 10https://gerrit.wikimedia.org/r/888335 [06:32:46] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Test db1164 in m1" [puppet] - 10https://gerrit.wikimedia.org/r/888335 (owner: 10Marostegui) [06:33:58] (03PS3) 10Marostegui: mariadb: Promote db1164 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/888359 (https://phabricator.wikimedia.org/T329259) [06:34:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2111.codfw.wmnet with reason: Maintenance [06:34:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2111.codfw.wmnet with reason: Maintenance [06:34:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T329203)', diff saved to https://phabricator.wikimedia.org/P44336 and previous config saved to /var/cache/conftool/dbconfig/20230213-063449-marostegui.json [06:34:53] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [06:36:25] (03PS1) 10Marostegui: Bug: T329181 [puppet] - 10https://gerrit.wikimedia.org/r/888516 (https://phabricator.wikimedia.org/T329181) [06:36:45] (03CR) 10CI reject: [V: 04-1] Bug: T329181 [puppet] - 10https://gerrit.wikimedia.org/r/888516 (https://phabricator.wikimedia.org/T329181) (owner: 10Marostegui) [06:36:57] (03PS2) 10Marostegui: instances.yaml: Remove db1099 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/888516 (https://phabricator.wikimedia.org/T329181) [06:37:18] (03CR) 10CI reject: [V: 04-1] instances.yaml: Remove db1099 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/888516 (https://phabricator.wikimedia.org/T329181) (owner: 10Marostegui) [06:39:50] (03PS3) 10Marostegui: instances.yaml: Remove db1099 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/888516 (https://phabricator.wikimedia.org/T329181) [06:39:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T329203)', diff saved to https://phabricator.wikimedia.org/P44337 and previous config saved to /var/cache/conftool/dbconfig/20230213-063955-marostegui.json [06:39:59] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [06:40:25] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1099 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/888516 (https://phabricator.wikimedia.org/T329181) (owner: 10Marostegui) [06:40:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1099 from dbctl T329181', diff saved to https://phabricator.wikimedia.org/P44338 and previous config saved to /var/cache/conftool/dbconfig/20230213-064051-marostegui.json [06:40:55] T329181: decommission db1099.eqiad.wmnet - https://phabricator.wikimedia.org/T329181 [06:46:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2100.codfw.wmnet with reason: Maintenance [06:47:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2100.codfw.wmnet with reason: Maintenance [06:55:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P44339 and previous config saved to /var/cache/conftool/dbconfig/20230213-065501-marostegui.json [06:59:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2132,2160].codfw.wmnet,db[1117,1164,1176].eqiad.wmnet with reason: Primary switchover m1 T329259 [06:59:28] T329259: Switchover m1 master (db1176 -> db1164) - https://phabricator.wikimedia.org/T329259 [06:59:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2132,2160].codfw.wmnet,db[1117,1164,1176].eqiad.wmnet with reason: Primary switchover m1 T329259 [07:06:06] (03PS1) 10Marostegui: control-mariadb-10.4-bullseye: Update version [software] - 10https://gerrit.wikimedia.org/r/888519 (https://phabricator.wikimedia.org/T329011) [07:06:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2152.codfw.wmnet with reason: Maintenance [07:07:03] (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.4-bullseye: Update version [software] - 10https://gerrit.wikimedia.org/r/888519 (https://phabricator.wikimedia.org/T329011) (owner: 10Marostegui) [07:07:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2152.codfw.wmnet with reason: Maintenance [07:07:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2152 (T328817)', diff saved to https://phabricator.wikimedia.org/P44340 and previous config saved to /var/cache/conftool/dbconfig/20230213-070717-marostegui.json [07:07:22] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [07:07:33] (03Merged) 10jenkins-bot: control-mariadb-10.4-bullseye: Update version [software] - 10https://gerrit.wikimedia.org/r/888519 (https://phabricator.wikimedia.org/T329011) (owner: 10Marostegui) [07:10:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P44341 and previous config saved to /var/cache/conftool/dbconfig/20230213-071007-marostegui.json [07:11:23] (03CR) 10Slyngshede: [C: 03+2] sre:ganeti:reimage switch tty [cookbooks] - 10https://gerrit.wikimedia.org/r/888241 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede) [07:12:18] slyngs: o/ thanks! Going to re-test my cookbook in a bit :) [07:12:25] (good morning) [07:12:36] Good morning to you too :-) [07:25:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T329203)', diff saved to https://phabricator.wikimedia.org/P44342 and previous config saved to /var/cache/conftool/dbconfig/20230213-072514-marostegui.json [07:25:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2123.codfw.wmnet with reason: Maintenance [07:25:18] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [07:25:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2123.codfw.wmnet with reason: Maintenance [07:25:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T329203)', diff saved to https://phabricator.wikimedia.org/P44343 and previous config saved to /var/cache/conftool/dbconfig/20230213-072535-marostegui.json [07:27:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T328817)', diff saved to https://phabricator.wikimedia.org/P44344 and previous config saved to /var/cache/conftool/dbconfig/20230213-072752-marostegui.json [07:27:57] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [07:28:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T329203)', diff saved to https://phabricator.wikimedia.org/P44345 and previous config saved to /var/cache/conftool/dbconfig/20230213-072825-marostegui.json [07:37:46] !log Deploy schema change on db2151 T329260 [07:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:50] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [07:39:32] !log elukey@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23 [07:41:18] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-staging-etcd2001.codfw.wmnet with OS bullseye [07:41:29] !log elukey@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host ml-staging-etcd2001.codfw.wmnet with OS bullseye [07:42:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P44346 and previous config saved to /var/cache/conftool/dbconfig/20230213-074258-marostegui.json [07:43:00] !log elukey@cumin1001 END (FAIL) - Cookbook sre.k8s.upgrade-cluster (exit_code=99) Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23 [07:43:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P44347 and previous config saved to /var/cache/conftool/dbconfig/20230213-074331-marostegui.json [07:46:50] !log elukey@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23 [07:47:12] !log elukey@cumin1001 END (FAIL) - Cookbook sre.k8s.upgrade-cluster (exit_code=99) Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23 [07:53:48] !log elukey@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23 [07:54:30] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-staging-etcd2001.codfw.wmnet with OS bullseye [07:54:39] !log elukey@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host ml-staging-etcd2001.codfw.wmnet with OS bullseye [07:55:26] !log elukey@cumin1001 END (FAIL) - Cookbook sre.k8s.upgrade-cluster (exit_code=99) Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23 [07:56:36] sorry for the spam folks, we are trying to solve a dhcp issue with the ganeti vm reimage [07:58:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P44348 and previous config saved to /var/cache/conftool/dbconfig/20230213-075805-marostegui.json [07:58:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P44349 and previous config saved to /var/cache/conftool/dbconfig/20230213-075838-marostegui.json [07:59:25] !log elukey@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23 [07:59:54] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-staging-etcd2001.codfw.wmnet with OS bullseye [08:00:04] Amir1 and Urbanecm: Dear deployers, time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230213T0800). [08:00:04] Jhs and Superpes: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:22] I'm here :) [08:00:45] RECOVERY - Check systemd state on ms-be1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:03:47] (03CR) 10Muehlenhoff: [C: 03+2] raid_handler: Use universal_newlines [puppet] - 10https://gerrit.wikimedia.org/r/888212 (https://phabricator.wikimedia.org/T325046) (owner: 10Muehlenhoff) [08:06:42] I can deploy today [08:08:47] (03PS3) 10Majavah: [bjnwiki] Change time zone setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888348 (https://phabricator.wikimedia.org/T328887) (owner: 10Superpes15) [08:08:52] Hi taavi :) [08:08:54] (03PS3) 10Majavah: [fawiki] Add an alias to Help namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888350 (https://phabricator.wikimedia.org/T329465) (owner: 10Superpes15) [08:09:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888331 (https://phabricator.wikimedia.org/T329399) (owner: 10Superpes15) [08:09:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888348 (https://phabricator.wikimedia.org/T328887) (owner: 10Superpes15) [08:09:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888350 (https://phabricator.wikimedia.org/T329465) (owner: 10Superpes15) [08:10:04] (03Merged) 10jenkins-bot: Add a temporary logo to trwikiquote (Vector legacy + Vector 2022) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888331 (https://phabricator.wikimedia.org/T329399) (owner: 10Superpes15) [08:10:06] (03Merged) 10jenkins-bot: [bjnwiki] Change time zone setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888348 (https://phabricator.wikimedia.org/T328887) (owner: 10Superpes15) [08:10:09] (03Merged) 10jenkins-bot: [fawiki] Add an alias to Help namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888350 (https://phabricator.wikimedia.org/T329465) (owner: 10Superpes15) [08:10:28] !log taavi@deploy1002 Started scap: Backport for [[gerrit:888331|Add a temporary logo to trwikiquote (Vector legacy + Vector 2022) (T329399)]], [[gerrit:888348|[bjnwiki] Change time zone setting (T328887)]], [[gerrit:888350|[fawiki] Add an alias to Help namespace (T329465)]] [08:10:34] T328887: Change time zone setting in Banjar Wikipedia - https://phabricator.wikimedia.org/T328887 [08:10:34] T329465: A new alias for "Help:" namespace and a new pseudo-namespace for "Manual of Styles" on Persian Wikipedia - https://phabricator.wikimedia.org/T329465 [08:10:35] T329399: Temporary logo change for trwikiquote - https://phabricator.wikimedia.org/T329399 [08:11:33] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-staging-etcd2001.codfw.wmnet with reason: host reimage [08:13:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T328817)', diff saved to https://phabricator.wikimedia.org/P44350 and previous config saved to /var/cache/conftool/dbconfig/20230213-081311-marostegui.json [08:13:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2154.codfw.wmnet with reason: Maintenance [08:13:15] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [08:13:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2154.codfw.wmnet with reason: Maintenance [08:13:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2154 (T328817)', diff saved to https://phabricator.wikimedia.org/P44351 and previous config saved to /var/cache/conftool/dbconfig/20230213-081332-marostegui.json [08:13:37] hm, the docker build is taking longer than usual [08:13:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T329203)', diff saved to https://phabricator.wikimedia.org/P44352 and previous config saved to /var/cache/conftool/dbconfig/20230213-081344-marostegui.json [08:13:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance [08:13:48] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [08:14:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance [08:14:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [08:14:03] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-staging-etcd2001.codfw.wmnet with reason: host reimage [08:14:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [08:14:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T329203)', diff saved to https://phabricator.wikimedia.org/P44353 and previous config saved to /var/cache/conftool/dbconfig/20230213-081431-marostegui.json [08:14:51] (03CR) 10Jelto: [V: 03+1] prometheus::node_exporter: remove /var/lib/docker from ignored_mount_points (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888009 (https://phabricator.wikimedia.org/T328972) (owner: 10Jelto) [08:14:57] (03PS2) 10Jelto: prometheus::node_exporter: remove /var/lib/docker from ignored_mount_points [puppet] - 10https://gerrit.wikimedia.org/r/888009 (https://phabricator.wikimedia.org/T328972) [08:17:02] !log installing curl security updates [08:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T329203)', diff saved to https://phabricator.wikimedia.org/P44354 and previous config saved to /var/cache/conftool/dbconfig/20230213-081722-marostegui.json [08:20:29] !log taavi@deploy1002 taavi and superpes: Backport for [[gerrit:888331|Add a temporary logo to trwikiquote (Vector legacy + Vector 2022) (T329399)]], [[gerrit:888348|[bjnwiki] Change time zone setting (T328887)]], [[gerrit:888350|[fawiki] Add an alias to Help namespace (T329465)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [08:20:34] finallly [08:20:35] T328887: Change time zone setting in Banjar Wikipedia - https://phabricator.wikimedia.org/T328887 [08:20:35] T329465: A new alias for "Help:" namespace and a new pseudo-namespace for "Manual of Styles" on Persian Wikipedia - https://phabricator.wikimedia.org/T329465 [08:20:36] T329399: Temporary logo change for trwikiquote - https://phabricator.wikimedia.org/T329399 [08:20:38] Superpes: please test all three patches [08:21:04] (03PS1) 10Marostegui: mariadb: Convert db1126 to s8 candidate master [puppet] - 10https://gerrit.wikimedia.org/r/888629 (https://phabricator.wikimedia.org/T329482) [08:21:06] (03CR) 10Filippo Giunchedi: prometheus::node_exporter: remove /var/lib/docker from ignored_mount_points (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888009 (https://phabricator.wikimedia.org/T328972) (owner: 10Jelto) [08:21:07] Thanks taavi trwikiquote is good! [08:22:36] what about the others? [08:22:48] Yep all good [08:22:59] (03CR) 10Marostegui: [C: 03+2] mariadb: Convert db1126 to s8 candidate master [puppet] - 10https://gerrit.wikimedia.org/r/888629 (https://phabricator.wikimedia.org/T329482) (owner: 10Marostegui) [08:23:00] ok, syncing [08:23:13] Thanks [08:25:47] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations, 10Patch-For-Review: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) @Jclark-ctr can you pull the disk out again for another test? Thanks [08:26:14] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ml-staging-etcd2001.codfw.wmnet with OS bullseye [08:27:02] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-staging-etcd2002.codfw.wmnet with OS bullseye [08:29:52] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:888331|Add a temporary logo to trwikiquote (Vector legacy + Vector 2022) (T329399)]], [[gerrit:888348|[bjnwiki] Change time zone setting (T328887)]], [[gerrit:888350|[fawiki] Add an alias to Help namespace (T329465)]] (duration: 19m 24s) [08:29:58] T328887: Change time zone setting in Banjar Wikipedia - https://phabricator.wikimedia.org/T328887 [08:29:58] T329465: A new alias for "Help:" namespace and a new pseudo-namespace for "Manual of Styles" on Persian Wikipedia - https://phabricator.wikimedia.org/T329465 [08:29:58] T329399: Temporary logo change for trwikiquote - https://phabricator.wikimedia.org/T329399 [08:30:02] woot [08:30:07] and no namespaceDupes on fawiki either [08:31:32] (03CR) 10Jelto: [C: 03+2] prometheus::node_exporter: remove /var/lib/docker from ignored_mount_points [puppet] - 10https://gerrit.wikimedia.org/r/888009 (https://phabricator.wikimedia.org/T328972) (owner: 10Jelto) [08:31:52] jhs is still not here :/ [08:32:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P44355 and previous config saved to /var/cache/conftool/dbconfig/20230213-083229-marostegui.json [08:33:12] Uhm yep it's not here... Well, in the meantime, thanks for your support and your time taavi :D [08:33:26] hth, as always [08:35:21] PROBLEM - Disk space on deploy1002 is CRITICAL: DISK CRITICAL - free space: /srv 15957 MB (5% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1002&var-datasource=eqiad+prometheus/ops [08:36:32] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-staging-etcd2002.codfw.wmnet with reason: host reimage [08:36:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T328817)', diff saved to https://phabricator.wikimedia.org/P44356 and previous config saved to /var/cache/conftool/dbconfig/20230213-083648-marostegui.json [08:36:52] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [08:37:32] (03PS3) 10Marostegui: dbbackups: Replace m1 master [puppet] - 10https://gerrit.wikimedia.org/r/887885 (https://phabricator.wikimedia.org/T329259) [08:39:02] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-staging-etcd2002.codfw.wmnet with reason: host reimage [08:41:30] (03PS4) 10Muehlenhoff: cookbooks.sre.elasticsearch.restart-nginx: New cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/887999 [08:43:40] Hi folks, i completely missed the start of the deployment window. Amir1 urbanecm . Is it still happening, or should I reschedule? [08:44:59] (03PS2) 10Jon Harald Søby: Rename project namespace in guwwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888200 (https://phabricator.wikimedia.org/T309054) [08:45:58] (03PS1) 10Vgutierrez: cache::haproxy: Update to 2.6.8 in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/888632 (https://phabricator.wikimedia.org/T321775) [08:47:09] (03CR) 10Muehlenhoff: [C: 03+2] cookbooks.sre.elasticsearch.restart-nginx: New cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/887999 (owner: 10Muehlenhoff) [08:47:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P44357 and previous config saved to /var/cache/conftool/dbconfig/20230213-084735-marostegui.json [08:51:30] !log rolling-restart of codfw swift frontends [08:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:42] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:codfw and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [08:51:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P44358 and previous config saved to /var/cache/conftool/dbconfig/20230213-085154-marostegui.json [08:52:38] (03CR) 10Nicolas Fraison: [C: 03+2] feat(presto): add gc logs [puppet] - 10https://gerrit.wikimedia.org/r/888214 (https://phabricator.wikimedia.org/T329054) (owner: 10Nicolas Fraison) [08:53:22] (03PS1) 10Ayounsi: Peering News: add splay and change run time [puppet] - 10https://gerrit.wikimedia.org/r/888633 [08:53:29] (03PS2) 10Slyngshede: linux-host-entries.ttyS0-115200 remove reimage test server. [puppet] - 10https://gerrit.wikimedia.org/r/888229 (https://phabricator.wikimedia.org/T324744) [08:53:36] (03CR) 10Slyngshede: linux-host-entries.ttyS0-115200 remove reimage test server. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888229 (https://phabricator.wikimedia.org/T324744) (owner: 10Slyngshede) [08:54:30] (03CR) 10Slyngshede: [C: 03+2] linux-host-entries.ttyS0-115200 remove reimage test server. [puppet] - 10https://gerrit.wikimedia.org/r/888229 (https://phabricator.wikimedia.org/T324744) (owner: 10Slyngshede) [08:54:55] (03CR) 10Ayounsi: [C: 03+2] Peering News: add splay and change run time [puppet] - 10https://gerrit.wikimedia.org/r/888633 (owner: 10Ayounsi) [08:54:59] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:codfw and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [09:02:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T329203)', diff saved to https://phabricator.wikimedia.org/P44360 and previous config saved to /var/cache/conftool/dbconfig/20230213-090241-marostegui.json [09:02:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance [09:02:46] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [09:02:52] (03CR) 10David Caro: P:wmcs::services::toolsdb_replica_cnf: don't manage the directories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888311 (https://phabricator.wikimedia.org/T329377) (owner: 10Majavah) [09:02:54] (03CR) 10Filippo Giunchedi: [C: 03+2] opensearch: reverse-proxy access to opensearch API [puppet] - 10https://gerrit.wikimedia.org/r/881839 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [09:02:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance [09:03:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T329203)', diff saved to https://phabricator.wikimedia.org/P44361 and previous config saved to /var/cache/conftool/dbconfig/20230213-090302-marostegui.json [09:03:24] !log rolling restart of Apache on mw/codfw servers to pick up updated libxml [09:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:06:46] (03PS1) 10Filippo Giunchedi: ssl: add public cert for kibana + logs-api [puppet] - 10https://gerrit.wikimedia.org/r/888634 (https://phabricator.wikimedia.org/T320702) [09:06:48] (03CR) 10Majavah: P:wmcs::services::toolsdb_replica_cnf: don't manage the directories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888311 (https://phabricator.wikimedia.org/T329377) (owner: 10Majavah) [09:07:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P44362 and previous config saved to /var/cache/conftool/dbconfig/20230213-090701-marostegui.json [09:07:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T329203)', diff saved to https://phabricator.wikimedia.org/P44363 and previous config saved to /var/cache/conftool/dbconfig/20230213-090712-marostegui.json [09:08:36] (03CR) 10CI reject: [V: 04-1] ssl: add public cert for kibana + logs-api [puppet] - 10https://gerrit.wikimedia.org/r/888634 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [09:11:01] (03PS6) 10Slyngshede: P:idm split IDM into staging and prod. [puppet] - 10https://gerrit.wikimedia.org/r/888169 [09:13:19] (03PS1) 10Marostegui: control-mariadb-client-10.6-bullseye: Update version [software] - 10https://gerrit.wikimedia.org/r/888635 (https://phabricator.wikimedia.org/T329011) [09:13:31] (03CR) 10Muehlenhoff: [C: 03+2] Reapply puppetdb role [puppet] - 10https://gerrit.wikimedia.org/r/888230 (owner: 10Muehlenhoff) [09:15:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:20:51] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:21:35] (03CR) 10Marostegui: [C: 03+2] control-mariadb-client-10.6-bullseye: Update version [software] - 10https://gerrit.wikimedia.org/r/888635 (https://phabricator.wikimedia.org/T329011) (owner: 10Marostegui) [09:22:05] (03Merged) 10jenkins-bot: control-mariadb-client-10.6-bullseye: Update version [software] - 10https://gerrit.wikimedia.org/r/888635 (https://phabricator.wikimedia.org/T329011) (owner: 10Marostegui) [09:22:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T328817)', diff saved to https://phabricator.wikimedia.org/P44365 and previous config saved to /var/cache/conftool/dbconfig/20230213-092207-marostegui.json [09:22:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2161.codfw.wmnet with reason: Maintenance [09:22:11] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [09:22:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P44366 and previous config saved to /var/cache/conftool/dbconfig/20230213-092218-marostegui.json [09:22:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2161.codfw.wmnet with reason: Maintenance [09:22:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2161 (T328817)', diff saved to https://phabricator.wikimedia.org/P44367 and previous config saved to /var/cache/conftool/dbconfig/20230213-092228-marostegui.json [09:22:40] (03CR) 10Filippo Giunchedi: "Overriding jenkins as the failing check is SPDX header, which isn't a thing for certs" [puppet] - 10https://gerrit.wikimedia.org/r/888634 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [09:22:44] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] ssl: add public cert for kibana + logs-api [puppet] - 10https://gerrit.wikimedia.org/r/888634 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [09:24:02] (03PS1) 10Nicolas Fraison: fix(presto): ensure log folder has appropriate right [puppet] - 10https://gerrit.wikimedia.org/r/888636 [09:24:04] (03PS1) 10Filippo Giunchedi: rake_modules: ignore crt files for SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/888637 [09:24:24] (03CR) 10CI reject: [V: 04-1] fix(presto): ensure log folder has appropriate right [puppet] - 10https://gerrit.wikimedia.org/r/888636 (owner: 10Nicolas Fraison) [09:24:42] (03CR) 10Filippo Giunchedi: "Discovered on https://gerrit.wikimedia.org/r/c/operations/puppet/+/888634" [puppet] - 10https://gerrit.wikimedia.org/r/888637 (owner: 10Filippo Giunchedi) [09:27:56] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39511/console" [puppet] - 10https://gerrit.wikimedia.org/r/888636 (owner: 10Nicolas Fraison) [09:28:56] (03PS2) 10Nicolas Fraison: fix(presto): ensure log folder has appropriate right [puppet] - 10https://gerrit.wikimedia.org/r/888636 [09:29:16] (03CR) 10CI reject: [V: 04-1] fix(presto): ensure log folder has appropriate right [puppet] - 10https://gerrit.wikimedia.org/r/888636 (owner: 10Nicolas Fraison) [09:31:25] (03PS3) 10Nicolas Fraison: fix(presto): ensure log folder has appropriate right [puppet] - 10https://gerrit.wikimedia.org/r/888636 [09:33:24] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Export routes generated from ARP/ND in EVPN - https://phabricator.wikimedia.org/T329369 (10ayounsi) Thanks for the task! It's a great tool to have and know we can use. I'd err on the side of holding it until we actually need it (and we prob... [09:34:41] (03PS4) 10Nicolas Fraison: fix(presto): ensure log folder has appropriate right [puppet] - 10https://gerrit.wikimedia.org/r/888636 (https://phabricator.wikimedia.org/T329054) [09:36:03] (03CR) 10Ayounsi: [C: 03+2] Netbox: add support for central Redis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385) (owner: 10Ayounsi) [09:37:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P44368 and previous config saved to /var/cache/conftool/dbconfig/20230213-093725-marostegui.json [09:39:01] (03PS2) 10Filippo Giunchedi: rake_modules: ignore crt files for SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/888637 [09:39:03] (03PS1) 10Filippo Giunchedi: ssl: update public cert for kibana + logs-api [puppet] - 10https://gerrit.wikimedia.org/r/888639 (https://phabricator.wikimedia.org/T320702) [09:40:28] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39512/console" [puppet] - 10https://gerrit.wikimedia.org/r/888632 (https://phabricator.wikimedia.org/T321775) (owner: 10Vgutierrez) [09:41:09] (03CR) 10Filippo Giunchedi: [C: 03+2] ssl: update public cert for kibana + logs-api [puppet] - 10https://gerrit.wikimedia.org/r/888639 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [09:41:15] (03PS2) 10Filippo Giunchedi: ssl: update public cert for kibana + logs-api [puppet] - 10https://gerrit.wikimedia.org/r/888639 (https://phabricator.wikimedia.org/T320702) [09:41:18] (03CR) 10Filippo Giunchedi: [V: 03+2] ssl: update public cert for kibana + logs-api [puppet] - 10https://gerrit.wikimedia.org/r/888639 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [09:41:49] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reimage for host test-reimage2001.codfw.wmnet with OS bullseye [09:41:57] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Update makevm to include completion of the installation with the puppet runs - https://phabricator.wikimedia.org/T306661 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by slyngshede@cumin1001 for host test-re... [09:43:20] !log elukey@cumin1001 END (ERROR) - Cookbook sre.ganeti.reimage (exit_code=97) for host ml-staging-etcd2002.codfw.wmnet with OS bullseye [09:43:22] !log elukey@cumin1001 END (FAIL) - Cookbook sre.k8s.upgrade-cluster (exit_code=99) Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23 [09:43:27] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::haproxy: Update to 2.6.8 in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/888632 (https://phabricator.wikimedia.org/T321775) (owner: 10Vgutierrez) [09:43:50] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/888637 (owner: 10Filippo Giunchedi) [09:44:40] !log rolling upgrade to HAProxy 2.6.8 in ulsfo - T321775 [09:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:44] T321775: Upgrade HAProxy on cp nodes to 2.6.x LTS - https://phabricator.wikimedia.org/T321775 [09:45:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T328817)', diff saved to https://phabricator.wikimedia.org/P44369 and previous config saved to /var/cache/conftool/dbconfig/20230213-094546-marostegui.json [09:45:50] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [09:46:07] (03PS1) 10Elukey: sre.k8s.upgrade-cluster: fix puppet disable action [cookbooks] - 10https://gerrit.wikimedia.org/r/888640 (https://phabricator.wikimedia.org/T327767) [09:46:33] (03PS4) 10Marostegui: mariadb: Promote db1164 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/888359 (https://phabricator.wikimedia.org/T329259) [09:47:38] (03CR) 10Klausman: [C: 03+1] sre.k8s.upgrade-cluster: fix puppet disable action [cookbooks] - 10https://gerrit.wikimedia.org/r/888640 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [09:50:16] (03CR) 10Btullis: [C: 03+1] "Looks good to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/888636 (https://phabricator.wikimedia.org/T329054) (owner: 10Nicolas Fraison) [09:50:44] (03CR) 10Nicolas Fraison: [C: 03+2] fix(presto): ensure log folder has appropriate right [puppet] - 10https://gerrit.wikimedia.org/r/888636 (https://phabricator.wikimedia.org/T329054) (owner: 10Nicolas Fraison) [09:51:27] !log slyngshede@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on test-reimage2001.codfw.wmnet with reason: host reimage [09:52:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T329203)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20230213-095231-marostegui.json [09:52:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance [09:52:39] (03PS5) 10Marostegui: mariadb: Promote db1164 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/888359 (https://phabricator.wikimedia.org/T329259) [09:52:40] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [09:52:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance [09:52:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T329203)', diff saved to https://phabricator.wikimedia.org/P44371 and previous config saved to /var/cache/conftool/dbconfig/20230213-095257-marostegui.json [09:54:03] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on test-reimage2001.codfw.wmnet with reason: host reimage [09:54:03] !log volans@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9 to netbox-next - volans@cumin1001 [09:54:17] (03PS2) 10Elukey: sre.k8s.upgrade-cluster: fix puppet disable action [cookbooks] - 10https://gerrit.wikimedia.org/r/888640 (https://phabricator.wikimedia.org/T327767) [09:57:21] (03CR) 10Klausman: [C: 03+1] sre.k8s.upgrade-cluster: fix puppet disable action [cookbooks] - 10https://gerrit.wikimedia.org/r/888640 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [09:57:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T329203)', diff saved to https://phabricator.wikimedia.org/P44372 and previous config saved to /var/cache/conftool/dbconfig/20230213-095757-marostegui.json [09:58:01] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [10:00:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P44373 and previous config saved to /var/cache/conftool/dbconfig/20230213-100053-marostegui.json [10:01:29] (03CR) 10Jcrespo: [C: 03+1] "heads up: the dbs have both binlog_format: mixed" [puppet] - 10https://gerrit.wikimedia.org/r/888359 (https://phabricator.wikimedia.org/T329259) (owner: 10Marostegui) [10:02:35] (03CR) 10Filippo Giunchedi: [C: 03+2] rake_modules: ignore crt files for SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/888637 (owner: 10Filippo Giunchedi) [10:05:31] (03PS6) 10Marostegui: mariadb: Promote db1164 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/888359 (https://phabricator.wikimedia.org/T329259) [10:05:44] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host test-reimage2001.codfw.wmnet with OS bullseye [10:05:53] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Update makevm to include completion of the installation with the puppet runs - https://phabricator.wikimedia.org/T306661 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by slyngshede@cumin1001 for host test-reimag... [10:06:04] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:06] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1164 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/888359 (https://phabricator.wikimedia.org/T329259) (owner: 10Marostegui) [10:06:15] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 46375 [10:06:22] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:33] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46375 [10:06:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2132,2160].codfw.wmnet,db[1117,1164,1176].eqiad.wmnet with reason: Primary switchover m1 T329259 [10:06:43] T329259: Switchover m1 master (db1176 -> db1164) - https://phabricator.wikimedia.org/T329259 [10:06:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2132,2160].codfw.wmnet,db[1117,1164,1176].eqiad.wmnet with reason: Primary switchover m1 T329259 [10:09:46] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash1023.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:10:20] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash1023.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:13:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P44374 and previous config saved to /var/cache/conftool/dbconfig/20230213-101304-marostegui.json [10:15:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:15:18] (03PS1) 10Volans: sre.deploy.python-code: allow to override user [cookbooks] - 10https://gerrit.wikimedia.org/r/888644 [10:15:22] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:15:56] !log volans@cumin1001 END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9 to netbox-next - volans@cumin1001 [10:16:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P44375 and previous config saved to /var/cache/conftool/dbconfig/20230213-101559-marostegui.json [10:16:28] !log stopping bacula and disabling puppet at backup1001 for m1 switchover T329259 [10:16:28] (03CR) 10Ayounsi: [C: 03+1] sre.deploy.python-code: allow to override user [cookbooks] - 10https://gerrit.wikimedia.org/r/888644 (owner: 10Volans) [10:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:31] T329259: Switchover m1 master (db1176 -> db1164) - https://phabricator.wikimedia.org/T329259 [10:16:50] job scraping from prometheus will complain [10:19:48] (03PS7) 10Slyngshede: P:idm split IDM into staging and prod. [puppet] - 10https://gerrit.wikimedia.org/r/888169 [10:20:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:20:45] (03CR) 10Slyngshede: [C: 03+2] P:idm split IDM into staging and prod. [puppet] - 10https://gerrit.wikimedia.org/r/888169 (owner: 10Slyngshede) [10:21:52] PROBLEM - Check systemd state on ms-be1069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:23:53] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/888640 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [10:24:13] (03CR) 10Volans: [C: 03+2] sre.deploy.python-code: allow to override user [cookbooks] - 10https://gerrit.wikimedia.org/r/888644 (owner: 10Volans) [10:25:05] 10SRE, 10serviceops: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 (10MoritzMuehlenhoff) [10:25:46] (JobUnavailable) firing: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:25:57] (03Merged) 10jenkins-bot: sre.deploy.python-code: allow to override user [cookbooks] - 10https://gerrit.wikimedia.org/r/888644 (owner: 10Volans) [10:26:37] (03PS1) 10Muehlenhoff: Add component/icu67 [puppet] - 10https://gerrit.wikimedia.org/r/888645 (https://phabricator.wikimedia.org/T329491) [10:27:30] (03PS3) 10Elukey: sre.k8s.upgrade-cluster: fix puppet disable action [cookbooks] - 10https://gerrit.wikimedia.org/r/888640 (https://phabricator.wikimedia.org/T327767) [10:28:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P44376 and previous config saved to /var/cache/conftool/dbconfig/20230213-102810-marostegui.json [10:29:02] Reduced availability for job bacula in ops@eqiad it's me, I don't think I can ack that individually [10:29:34] (03PS1) 10Filippo Giunchedi: opensearch: add aliases to dashboards vhost [puppet] - 10https://gerrit.wikimedia.org/r/888646 (https://phabricator.wikimedia.org/T320702) [10:29:54] (03CR) 10CI reject: [V: 04-1] opensearch: add aliases to dashboards vhost [puppet] - 10https://gerrit.wikimedia.org/r/888646 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [10:29:55] jynus: you can [10:30:20] i.e. the "silence this alert" action on alerts.w.o will DTRT [10:30:51] but wouldn't that silence all jobs? [10:31:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T328817)', diff saved to https://phabricator.wikimedia.org/P44377 and previous config saved to /var/cache/conftool/dbconfig/20230213-103105-marostegui.json [10:31:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2162.codfw.wmnet with reason: Maintenance [10:31:10] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [10:31:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2162.codfw.wmnet with reason: Maintenance [10:31:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2162 (T328817)', diff saved to https://phabricator.wikimedia.org/P44378 and previous config saved to /var/cache/conftool/dbconfig/20230213-103126-marostegui.json [10:31:56] (03PS2) 10Filippo Giunchedi: opensearch: add aliases to dashboards vhost [puppet] - 10https://gerrit.wikimedia.org/r/888646 (https://phabricator.wikimedia.org/T320702) [10:32:23] jynus: no, the popup will have all labels filled in, including job=bacula [10:32:29] ok, tryingh [10:32:50] you can double check with the "preview" feature [10:33:22] "Silence submitted" [10:33:42] sweet! [10:34:36] (03CR) 10Muehlenhoff: [C: 03+2] Add component/icu67 [puppet] - 10https://gerrit.wikimedia.org/r/888645 (https://phabricator.wikimedia.org/T329491) (owner: 10Muehlenhoff) [10:34:37] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-worker1001.eqiad.wmnet with OS bullseye [10:36:22] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39513/console" [puppet] - 10https://gerrit.wikimedia.org/r/888646 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [10:37:05] (03CR) 10Elukey: [C: 03+2] sre.k8s.upgrade-cluster: fix puppet disable action [cookbooks] - 10https://gerrit.wikimedia.org/r/888640 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [10:39:08] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] "Mid-deployment so going ahead" [puppet] - 10https://gerrit.wikimedia.org/r/888646 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [10:40:30] PROBLEM - Etcd cluster health on ml-staging-etcd2002 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [10:40:42] PROBLEM - etcd service on ml-staging-etcd2002 is CRITICAL: CRITICAL - Expecting active but unit etcd is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:43:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T329203)', diff saved to https://phabricator.wikimedia.org/P44379 and previous config saved to /var/cache/conftool/dbconfig/20230213-104316-marostegui.json [10:43:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [10:43:21] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [10:43:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [10:43:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T329203)', diff saved to https://phabricator.wikimedia.org/P44380 and previous config saved to /var/cache/conftool/dbconfig/20230213-104337-marostegui.json [10:45:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:45:27] (03PS1) 10Filippo Giunchedi: hieradata: fix kibana7 vhost selection for pybal health checks [puppet] - 10https://gerrit.wikimedia.org/r/888648 (https://phabricator.wikimedia.org/T320702) [10:46:52] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-worker1001.eqiad.wmnet with reason: host reimage [10:48:29] !log nfraison@cumin1001 START - Cookbook sre.presto.roll-restart-workers for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [10:48:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T329203)', diff saved to https://phabricator.wikimedia.org/P44381 and previous config saved to /var/cache/conftool/dbconfig/20230213-104850-marostegui.json [10:48:54] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [10:49:29] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-worker1001.eqiad.wmnet with reason: host reimage [10:50:50] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:51:02] (03CR) 10Vgutierrez: [C: 03+1] "looks good, logstash.wm.o is listed on the SAN list used by the backend servers:" [puppet] - 10https://gerrit.wikimedia.org/r/888648 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [10:51:42] vgutierrez: thank you sir [10:51:52] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: fix kibana7 vhost selection for pybal health checks [puppet] - 10https://gerrit.wikimedia.org/r/888648 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [10:53:01] (03PS3) 10Mazevedo: Add iOS stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887998 (https://phabricator.wikimedia.org/T328697) [10:53:07] (03CR) 10Cathal Mooney: [C: 04-1] "Downvoting for now, we'll not merge this until/if it is required." [homer/public] - 10https://gerrit.wikimedia.org/r/888219 (https://phabricator.wikimedia.org/T329369) (owner: 10Cathal Mooney) [10:54:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T328817)', diff saved to https://phabricator.wikimedia.org/P44382 and previous config saved to /var/cache/conftool/dbconfig/20230213-105422-marostegui.json [10:54:26] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [10:54:33] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Export routes generated from ARP/ND in EVPN - https://phabricator.wikimedia.org/T329369 (10cmooney) @ayounsi yep I'm inclined to agree. I've -1'd the patch and we can merge if we need it. [10:56:20] !log roll-restart pybal in eqiad to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/888648 - T320702 [10:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:24] T320702: Jaeger secure access to OpenSearch cluster - https://phabricator.wikimedia.org/T320702 [10:56:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "I believe this was required for old Debian Stretch virtual machines, which were the last that had the .wmflabs domain." [puppet] - 10https://gerrit.wikimedia.org/r/852836 (owner: 10Majavah) [10:57:01] jouncebot: nowandnext [10:57:01] No deployments scheduled for the next 0 hour(s) and 2 minute(s) [10:57:01] In 0 hour(s) and 2 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230213T1100) [10:58:06] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:59:49] waiting for the lvs1019 recovery shortly [11:00:00] !log Failover m1 from db1176 to db1164 - T329259 [11:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230213T1100) [11:00:04] T329259: Switchover m1 master (db1176 -> db1164) - https://phabricator.wikimedia.org/T329259 [11:00:24] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:00:37] something I can check for m1? [11:00:37] all done [11:00:44] jynus: I am checking the services now [11:01:02] I might need to restart etherpad [11:01:48] indeed "upstream connect error or disconnect/reset before headers. reset reason: connection failure" [11:01:59] Just restarted it [11:02:00] same [11:02:22] now is saying loading [11:02:26] there's hope :D [11:02:29] works for me now after the restart [11:02:31] (03CR) 10Jbond: [C: 04-1] "sorry missed this on the last one" [puppet] - 10https://gerrit.wikimedia.org/r/888065 (owner: 10Herron) [11:02:39] working now for me too [11:02:41] it works now, probably has a cold start [11:02:42] yep etherpad is back up for me [11:02:57] marostegui: any connections left on the old db? [11:03:10] nop [11:03:26] let's merge the other patch, the one for monitoring [11:03:29] ok [11:03:31] or should I= [11:03:33] ? [11:03:41] (03PS4) 10Marostegui: dbbackups: Replace m1 master [puppet] - 10https://gerrit.wikimedia.org/r/887885 (https://phabricator.wikimedia.org/T329259) [11:03:50] jynus: I am doing it [11:03:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P44383 and previous config saved to /var/cache/conftool/dbconfig/20230213-110356-marostegui.json [11:04:33] ok let me know when merged so I can run puppet on a host and do a test backup [11:04:47] yeah, waiting for CI [11:05:42] (03CR) 10Marostegui: [C: 03+2] dbbackups: Replace m1 master [puppet] - 10https://gerrit.wikimedia.org/r/887885 (https://phabricator.wikimedia.org/T329259) (owner: 10Marostegui) [11:06:00] jynus: merged [11:06:23] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) [11:07:30] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 7 hosts with reason: Cluster half broken, in the middle of upgrading [11:07:48] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 7 hosts with reason: Cluster half broken, in the middle of upgrading [11:08:07] all good: dump.db_inventory.2023-02-13--11-07-38 finished dump eqiad db_inventory Feb. 13, 2023, 11:07 a.m. Feb. 13, 2023, 11:07 a.m. 1s 93.0 KB [11:08:16] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) [11:08:29] I will then now start bacula and run a gerrit backup too [11:08:43] !log rolling out no_proxy change https://gerrit.wikimedia.org/r/c/operations/puppet/+/879418 [11:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:47] (03CR) 10Jbond: [C: 03+2] P:environment: roll out no proxy config to all hosts [puppet] - 10https://gerrit.wikimedia.org/r/879418 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [11:09:04] jynus: good [11:09:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P44384 and previous config saved to /var/cache/conftool/dbconfig/20230213-110928-marostegui.json [11:10:25] Service[bacula-director]/ensure: ensure changed 'stopped' to 'running' [11:11:35] gerrit1001.wikimedia.org-Hourly-Sun-productionEqiad-gerrit-repo-data is running [11:11:51] 112.9 M OK [11:11:58] so all good on my side [11:12:08] (03PS1) 10Vgutierrez: acme_chief: support several passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/888652 (https://phabricator.wikimedia.org/T321309) [11:12:17] jynus: excellent thanks [11:14:47] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts install2003.wikimedia.org [11:15:03] (03PS8) 10Elukey: services: add the first lift wing stream to change-prop [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) [11:15:05] (03PS1) 10Elukey: changeprop: use a more generic name for events in liftwing's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/888653 (https://phabricator.wikimedia.org/T328576) [11:15:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:17:01] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [11:18:55] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:19:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P44385 and previous config saved to /var/cache/conftool/dbconfig/20230213-111902-marostegui.json [11:19:21] !log nfraison@cumin1001 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [11:19:54] PROBLEM - Check systemd state on an-presto1007 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:20:30] PROBLEM - Check systemd state on an-presto1008 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:20:32] godog: the alert is no longer ongoing, right? https://alerts.wikimedia.org/?q=team%3Dsre&q=alertname%3DJobUnavailable [11:20:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:21:40] jynus: that's right yeah [11:22:09] thanks, I confirmed it with grafana- I am not yet super confortable with alertmanager ui [11:22:13] (03CR) 10Btullis: [C: 03+2] Remove the GPU configuration from an-worker109[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/887807 (https://phabricator.wikimedia.org/T318696) (owner: 10Btullis) [11:22:34] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install2003.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:22:58] PROBLEM - Check systemd state on an-presto1006 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:23:32] (03PS2) 10Vgutierrez: acme_chief: support several passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/888652 (https://phabricator.wikimedia.org/T321309) [11:24:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P44386 and previous config saved to /var/cache/conftool/dbconfig/20230213-112435-marostegui.json [11:26:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install2003.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:26:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:26:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts install2003.wikimedia.org [11:27:03] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `install2003.wikimedia.org` - install2003.wikimedia.org (**PASS**) - Down... [11:29:02] (03PS1) 10Muehlenhoff: Remove Puppet references to install[12]003 [puppet] - 10https://gerrit.wikimedia.org/r/888656 (https://phabricator.wikimedia.org/T327867) [11:29:15] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/844998 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [11:29:40] (03CR) 10Jbond: [C: 03+2] "lgtm thanks will merge" [puppet] - 10https://gerrit.wikimedia.org/r/879854 (owner: 10Zabe) [11:29:49] (03PS5) 10ArielGlenn: use lbzip2 instead of bzcat to decompress blocks in parallel [puppet] - 10https://gerrit.wikimedia.org/r/887776 (https://phabricator.wikimedia.org/T328804) (owner: 10Hokwelum) [11:29:58] (03CR) 10Muehlenhoff: [C: 03+2] Remove Puppet references to install[12]003 [puppet] - 10https://gerrit.wikimedia.org/r/888656 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [11:31:19] (03CR) 10ArielGlenn: [C: 03+2] use lbzip2 instead of bzcat to decompress blocks in parallel [puppet] - 10https://gerrit.wikimedia.org/r/887776 (https://phabricator.wikimedia.org/T328804) (owner: 10Hokwelum) [11:31:42] (03CR) 10Jbond: [C: 03+2] "lgtm will merge" [puppet] - 10https://gerrit.wikimedia.org/r/881602 (owner: 10Majavah) [11:31:53] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts install1003.wikimedia.org [11:32:19] apergos: happy for me to merge your change "se lbzip2 instead of bzcat to decompress blocks in parallel" [11:32:40] please do, I went to puppet-merge and it informed me you were the cause of the edit conflict :-P [11:32:50] well that and "w" :-P [11:33:02] jbond: [11:33:16] :) done [11:33:32] thank you! [11:33:36] np [11:34:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T329203)', diff saved to https://phabricator.wikimedia.org/P44387 and previous config saved to /var/cache/conftool/dbconfig/20230213-113408-marostegui.json [11:34:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance [11:34:13] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [11:34:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance [11:34:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T329203)', diff saved to https://phabricator.wikimedia.org/P44388 and previous config saved to /var/cache/conftool/dbconfig/20230213-113430-marostegui.json [11:35:46] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:35:53] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:37:43] !log volans@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9 to netbox-next - volans@cumin1001 [11:37:55] !log volans@cumin1001 END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9 to netbox-next - volans@cumin1001 [11:38:10] (03CR) 10Jbond: [C: 03+2] "i have deployed this and seems like we at least get results on puppetmaster1001, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/881602 (owner: 10Majavah) [11:38:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T329203)', diff saved to https://phabricator.wikimedia.org/P44389 and previous config saved to /var/cache/conftool/dbconfig/20230213-113821-marostegui.json [11:38:35] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/883700 (owner: 10Ayounsi) [11:39:09] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install1003.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:39:22] (03PS1) 10Volans: sre.deploy.python-code: actually use -u/--user [cookbooks] - 10https://gerrit.wikimedia.org/r/888658 [11:39:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T328817)', diff saved to https://phabricator.wikimedia.org/P44390 and previous config saved to /var/cache/conftool/dbconfig/20230213-113941-marostegui.json [11:39:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2163.codfw.wmnet with reason: Maintenance [11:39:45] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [11:39:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2163.codfw.wmnet with reason: Maintenance [11:40:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2163 (T328817)', diff saved to https://phabricator.wikimedia.org/P44391 and previous config saved to /var/cache/conftool/dbconfig/20230213-114002-marostegui.json [11:40:32] (03CR) 10Ayounsi: [C: 03+1] sre.deploy.python-code: actually use -u/--user [cookbooks] - 10https://gerrit.wikimedia.org/r/888658 (owner: 10Volans) [11:40:41] (03PS1) 10David Caro: wmcs ceph:Move cloudcephosd1001/1002 to e4 [puppet] - 10https://gerrit.wikimedia.org/r/888659 (https://phabricator.wikimedia.org/T329498) [11:41:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install1003.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:41:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:41:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts install1003.wikimedia.org [11:41:10] 10SRE, 10Infrastructure-Foundations: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `install1003.wikimedia.org` - install1003.wikimedia.org (**PASS**) - Downtimed host on Icinga/A... [11:41:32] 10SRE, 10Infrastructure-Foundations: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10MoritzMuehlenhoff) [11:42:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [11:42:06] 10SRE, 10Infrastructure-Foundations: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All install servers are now on Bullseye. [11:42:20] (03CR) 10Volans: [C: 03+2] sre.deploy.python-code: actually use -u/--user [cookbooks] - 10https://gerrit.wikimedia.org/r/888658 (owner: 10Volans) [11:44:09] (03Merged) 10jenkins-bot: sre.deploy.python-code: actually use -u/--user [cookbooks] - 10https://gerrit.wikimedia.org/r/888658 (owner: 10Volans) [11:44:30] w/in 25 [11:44:33] lolno [11:45:06] godog: lolfingers [11:45:17] indeed [11:45:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:45:32] (03PS1) 10David Caro: wmcs.ceph: move cloudcephosd1003/1004 to e4/f4 [puppet] - 10https://gerrit.wikimedia.org/r/888660 (https://phabricator.wikimedia.org/T329502) [11:45:35] !log volans@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9 to netbox-next - volans@cumin1001 [11:45:56] (03CR) 10MVernon: [C: 03+1] "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/888657 (owner: 10Clément Goubert) [11:47:37] PROBLEM - Check systemd state on an-presto1014 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:43] !log volans@cumin1001 END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9 to netbox-next - volans@cumin1001 [11:49:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: provision .kube/config in root home directory [puppet] - 10https://gerrit.wikimedia.org/r/888245 (https://phabricator.wikimedia.org/T329376) (owner: 10Majavah) [11:52:37] (03CR) 10Muehlenhoff: [C: 03+2] Add safe.directory directives for the puppet master [puppet] - 10https://gerrit.wikimedia.org/r/888053 (owner: 10Muehlenhoff) [11:53:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P44392 and previous config saved to /var/cache/conftool/dbconfig/20230213-115327-marostegui.json [11:53:28] (03PS2) 10David Caro: wmcs ceph:Move cloudcephosd1001/1002 to e4 [puppet] - 10https://gerrit.wikimedia.org/r/888659 (https://phabricator.wikimedia.org/T329498) [11:53:30] (03PS2) 10David Caro: wmcs.ceph: move cloudcephosd1003/1004 to e4/f4 [puppet] - 10https://gerrit.wikimedia.org/r/888660 (https://phabricator.wikimedia.org/T329502) [11:53:32] (03PS1) 10David Caro: wmcs.ceph: move cloudcephosd1005/1010 to f4 [puppet] - 10https://gerrit.wikimedia.org/r/888663 (https://phabricator.wikimedia.org/T329504) [11:53:35] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] thumbor: Raise firejail's nofile limit to 8192 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888657 (owner: 10Clément Goubert) [11:54:00] (03PS1) 10Sergio Gimeno: GrowthExperiments: Enable link recommendation for 6th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888664 (https://phabricator.wikimedia.org/T304550) [11:54:10] (03PS1) 10Volans: netbox: add python_deploy::venv [puppet] - 10https://gerrit.wikimedia.org/r/888665 [11:55:16] (03CR) 10Sergio Gimeno: [C: 04-1] "Waiting to align a communication plan." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888664 (https://phabricator.wikimedia.org/T304550) (owner: 10Sergio Gimeno) [11:55:53] (03PS1) 10Majavah: Add alerts for expiring Puppet CA certificates [alerts] - 10https://gerrit.wikimedia.org/r/888667 [11:56:00] (03CR) 10CI reject: [V: 04-1] netbox: add python_deploy::venv [puppet] - 10https://gerrit.wikimedia.org/r/888665 (owner: 10Volans) [11:57:56] (03CR) 10David Caro: wmcs.ceph: move cloudcephosd1005/1010 to f4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888663 (https://phabricator.wikimedia.org/T329504) (owner: 10David Caro) [11:58:43] (03PS2) 10David Caro: wmcs.ceph: move cloudcephosd1005/1010 to f4 [puppet] - 10https://gerrit.wikimedia.org/r/888663 (https://phabricator.wikimedia.org/T329504) [11:58:50] (03CR) 10David Caro: wmcs.ceph: move cloudcephosd1005/1010 to f4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888663 (https://phabricator.wikimedia.org/T329504) (owner: 10David Caro) [11:58:53] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor Migration: Pooling thumbor-k8s causes spikes in swift 500 errors - https://phabricator.wikimedia.org/T328033 (10MatthewVernon) Noting this here in case it's relevant - looking at swift 502s this morning (cf P44364) we found that a number of these er... [12:02:21] (03PS2) 10Volans: netbox: add python_deploy::venv [puppet] - 10https://gerrit.wikimedia.org/r/888665 [12:03:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T328817)', diff saved to https://phabricator.wikimedia.org/P44393 and previous config saved to /var/cache/conftool/dbconfig/20230213-120309-marostegui.json [12:03:14] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [12:04:46] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/888665 (owner: 10Volans) [12:05:53] (03CR) 10Jbond: "not sure about the actual change but have made some general comments on the python side" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/884908 (https://phabricator.wikimedia.org/T328313) (owner: 10Cathal Mooney) [12:07:05] (03CR) 10Ayounsi: [C: 03+1] netbox: add python_deploy::venv [puppet] - 10https://gerrit.wikimedia.org/r/888665 (owner: 10Volans) [12:07:39] !log Roll-restart thumbor in codfw - Deploying CR 888657 [12:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:09] (03CR) 10Jbond: "lgtm on nit" [puppet] - 10https://gerrit.wikimedia.org/r/888665 (owner: 10Volans) [12:08:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P44394 and previous config saved to /var/cache/conftool/dbconfig/20230213-120833-marostegui.json [12:08:58] oh, nice, TIL [12:09:03] * volans wrong tab [12:10:02] (03PS3) 10Volans: netbox: add python_deploy::venv [puppet] - 10https://gerrit.wikimedia.org/r/888665 [12:11:17] (03PS2) 10Jbond: bgpalerter: Pass through RPKI config [puppet] - 10https://gerrit.wikimedia.org/r/878877 [12:11:54] (03CR) 10CI reject: [V: 04-1] netbox: add python_deploy::venv [puppet] - 10https://gerrit.wikimedia.org/r/888665 (owner: 10Volans) [12:12:59] 10SRE, 10SRE-Access-Requests: Requesting access to analytics cluster for Nicolas Fraison - https://phabricator.wikimedia.org/T328915 (10cmooney) 05Open→03Resolved I think all is complete with this request. @nfraison if I'm incorrect please re-open and I'll have a look. thanks. [12:13:08] (03CR) 10CI reject: [V: 04-1] bgpalerter: Pass through RPKI config [puppet] - 10https://gerrit.wikimedia.org/r/878877 (owner: 10Jbond) [12:13:36] (03PS1) 10Marostegui: db1205,db2184: Migrate them to MariaDB 10.6.12 [puppet] - 10https://gerrit.wikimedia.org/r/888673 (https://phabricator.wikimedia.org/T329499) [12:14:43] (03CR) 10Marostegui: [C: 03+2] db1205,db2184: Migrate them to MariaDB 10.6.12 [puppet] - 10https://gerrit.wikimedia.org/r/888673 (https://phabricator.wikimedia.org/T329499) (owner: 10Marostegui) [12:15:16] !log Upgrade db1205 and db2184 to mariadb 10.6.12 T329499 [12:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:19] T329499: Migrate backup1-* replicas to MariaDB 10.6 - https://phabricator.wikimedia.org/T329499 [12:15:40] (03PS4) 10Volans: netbox: add python_deploy::venv [puppet] - 10https://gerrit.wikimedia.org/r/888665 [12:15:47] (03CR) 10Volans: "addressed comments" [puppet] - 10https://gerrit.wikimedia.org/r/888665 (owner: 10Volans) [12:17:56] (03PS3) 10Jbond: bgpalerter: Pass through RPKI config [puppet] - 10https://gerrit.wikimedia.org/r/878877 [12:18:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P44395 and previous config saved to /var/cache/conftool/dbconfig/20230213-121816-marostegui.json [12:18:35] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/888665 (owner: 10Volans) [12:18:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39520/console" [puppet] - 10https://gerrit.wikimedia.org/r/878877 (owner: 10Jbond) [12:19:29] (03PS4) 10Jbond: bgpalerter: Pass through RPKI config [puppet] - 10https://gerrit.wikimedia.org/r/878877 [12:20:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, but please collect +1 from somebody else, this has been an important piece of grid Toolforge in the past. Others in the team may hav" [puppet] - 10https://gerrit.wikimedia.org/r/888347 (https://phabricator.wikimedia.org/T329467) (owner: 10Majavah) [12:21:10] (03CR) 10Jbond: "Going through old changes and found this. is it useful?" [puppet] - 10https://gerrit.wikimedia.org/r/878877 (owner: 10Jbond) [12:22:02] !log Roll-restart thumbor in eqiad - Deploying CR 888657 [12:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:23] (03CR) 10Jbond: [C: 03+1] "LGTM thanks, will leave to filippo to merge" [alerts] - 10https://gerrit.wikimedia.org/r/888667 (owner: 10Majavah) [12:23:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T329203)', diff saved to https://phabricator.wikimedia.org/P44396 and previous config saved to /var/cache/conftool/dbconfig/20230213-122339-marostegui.json [12:23:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1110.eqiad.wmnet with reason: Maintenance [12:23:43] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [12:23:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1110.eqiad.wmnet with reason: Maintenance [12:24:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T329203)', diff saved to https://phabricator.wikimedia.org/P44397 and previous config saved to /var/cache/conftool/dbconfig/20230213-122401-marostegui.json [12:24:15] (03CR) 10Ayounsi: [C: 03+1] netbox: add python_deploy::venv [puppet] - 10https://gerrit.wikimedia.org/r/888665 (owner: 10Volans) [12:24:42] (03CR) 10Volans: [C: 03+2] netbox: add python_deploy::venv [puppet] - 10https://gerrit.wikimedia.org/r/888665 (owner: 10Volans) [12:25:04] !log thumbor roll-restarts done [12:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:20] ACKNOWLEDGEMENT - Check systemd state on an-presto1006 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service Nicolas Fraison Presto is currently disabled on those nodes while working on T325809 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:26:20] ACKNOWLEDGEMENT - Check systemd state on an-presto1007 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service Nicolas Fraison Presto is currently disabled on those nodes while working on T325809 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:26:20] ACKNOWLEDGEMENT - Check systemd state on an-presto1008 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service Nicolas Fraison Presto is currently disabled on those nodes while working on T325809 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:19] (03CR) 10Ayounsi: [C: 03+1] "Now that it uses a central Redis, we can go for active/active on the frontends." [puppet] - 10https://gerrit.wikimedia.org/r/808199 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [12:27:44] (03CR) 10Ayounsi: "Now that it uses a central Redis, we can go for active/active on the frontends." [dns] - 10https://gerrit.wikimedia.org/r/808198 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [12:28:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T329203)', diff saved to https://phabricator.wikimedia.org/P44398 and previous config saved to /var/cache/conftool/dbconfig/20230213-122808-marostegui.json [12:28:14] !log volans@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9 to netbox-next - volans@cumin1001 [12:30:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:48] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [12:31:02] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [12:33:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P44399 and previous config saved to /var/cache/conftool/dbconfig/20230213-123322-marostegui.json [12:33:35] (03PS2) 10Jbond: netbox: update netbox service to active/active [puppet] - 10https://gerrit.wikimedia.org/r/808199 (https://phabricator.wikimedia.org/T296452) [12:34:09] (03CR) 10Jbond: "rebased and now we have central redis ready for another review" [puppet] - 10https://gerrit.wikimedia.org/r/808199 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [12:34:12] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [12:34:32] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10dcaro) [12:34:53] (03PS3) 10Jbond: netbox: update netbox so that its active/active [dns] - 10https://gerrit.wikimedia.org/r/808198 (https://phabricator.wikimedia.org/T296452) [12:35:13] (03CR) 10Jbond: "rebased and ready for another review now we are on central redis" [dns] - 10https://gerrit.wikimedia.org/r/808198 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [12:35:52] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/878877 (owner: 10Jbond) [12:35:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:37:15] (03CR) 10Ayounsi: [C: 03+1] "It makes sens to merge it now that it's ready." [puppet] - 10https://gerrit.wikimedia.org/r/878877 (owner: 10Jbond) [12:39:04] (03PS4) 10Ayounsi: Remove "old" VRRP support [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/838171 (https://phabricator.wikimedia.org/T260363) [12:40:03] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [12:40:32] (03PS3) 10Majavah: openstack: encapi: create parent directories for files [puppet] - 10https://gerrit.wikimedia.org/r/881711 [12:40:37] (03PS1) 10Nicolas Fraison: fix(presto): do not set query.max*per-node config on coordinator [puppet] - 10https://gerrit.wikimedia.org/r/888685 [12:40:43] (03CR) 10Majavah: openstack: encapi: create parent directories for files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/881711 (owner: 10Majavah) [12:41:58] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [12:43:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P44400 and previous config saved to /var/cache/conftool/dbconfig/20230213-124314-marostegui.json [12:43:50] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39521/console" [puppet] - 10https://gerrit.wikimedia.org/r/888685 (owner: 10Nicolas Fraison) [12:44:39] 10SRE-tools, 10Infrastructure-Foundations, 10homer: Add CI to homer-deploy repo - https://phabricator.wikimedia.org/T277440 (10ayounsi) 05Open→03Resolved a:03ayounsi Confirmed working with https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/+/838171/4#message-0f67c7e34484d48fd1c17bdfdcf95... [12:47:44] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [12:48:22] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM, nice to be removing some code from there :)" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/838171 (https://phabricator.wikimedia.org/T260363) (owner: 10Ayounsi) [12:48:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T328817)', diff saved to https://phabricator.wikimedia.org/P44401 and previous config saved to /var/cache/conftool/dbconfig/20230213-124828-marostegui.json [12:48:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2164.codfw.wmnet with reason: Maintenance [12:48:32] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [12:48:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2164.codfw.wmnet with reason: Maintenance [12:48:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [12:48:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [12:48:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2164 (T328817)', diff saved to https://phabricator.wikimedia.org/P44402 and previous config saved to /var/cache/conftool/dbconfig/20230213-124853-marostegui.json [12:51:15] (03PS1) 10Jelto: sre.gitlab.upgrade: post boardcast message during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/888686 (https://phabricator.wikimedia.org/T323569) [12:52:41] (03CR) 10Jbond: [C: 03+2] bgpalerter: Pass through RPKI config [puppet] - 10https://gerrit.wikimedia.org/r/878877 (owner: 10Jbond) [12:53:03] (03CR) 10CI reject: [V: 04-1] sre.gitlab.upgrade: post boardcast message during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/888686 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [12:53:26] (03PS1) 10Andrew Bogott: wmcs vm backups: don't email when the backup service flaps [puppet] - 10https://gerrit.wikimedia.org/r/888687 [12:55:09] (03PS1) 10Volans: Makefile.deploy: fix detection of CA_BUNDLE [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/888689 [12:55:23] (03PS2) 10Jelto: sre.gitlab.upgrade: post boardcast message during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/888686 (https://phabricator.wikimedia.org/T323569) [12:55:34] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/888687 (owner: 10Andrew Bogott) [12:56:47] (03CR) 10Volans: "message nits" [cookbooks] - 10https://gerrit.wikimedia.org/r/888686 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [12:56:51] (03PS1) 10EoghanGaffney: Add puppet role for aphlict vm in codfw [puppet] - 10https://gerrit.wikimedia.org/r/888690 (https://phabricator.wikimedia.org/T322369) [12:57:37] (03CR) 10Andrew Bogott: [C: 03+2] wmcs vm backups: don't email when the backup service flaps [puppet] - 10https://gerrit.wikimedia.org/r/888687 (owner: 10Andrew Bogott) [12:58:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P44403 and previous config saved to /var/cache/conftool/dbconfig/20230213-125821-marostegui.json [12:58:23] (03PS3) 10Jelto: sre.gitlab.upgrade: post broadcast message during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/888686 (https://phabricator.wikimedia.org/T323569) [12:59:18] (03CR) 10Jelto: sre.gitlab.upgrade: post broadcast message during upgrade (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/888686 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [13:00:14] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/888689 (owner: 10Volans) [13:00:42] (03CR) 10Volans: [V: 03+2 C: 03+2] Makefile.deploy: fix detection of CA_BUNDLE [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/888689 (owner: 10Volans) [13:01:13] !log volans@cumin1001 END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9 to netbox-next - volans@cumin1001 [13:01:26] (03CR) 10David Caro: [C: 03+1] "LGTM, can be improved later if needed" [puppet] - 10https://gerrit.wikimedia.org/r/888311 (https://phabricator.wikimedia.org/T329377) (owner: 10Majavah) [13:02:04] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "Nice! This was the missing piece of information for me "how the `event` key is added before to the payload". all covered now!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/888653 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [13:02:13] (03CR) 10Filippo Giunchedi: [C: 03+2] "Neat! Thank you Majavah" [alerts] - 10https://gerrit.wikimedia.org/r/888667 (owner: 10Majavah) [13:03:13] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39522/console" [puppet] - 10https://gerrit.wikimedia.org/r/888690 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [13:03:15] (03CR) 10David Caro: [C: 03+2] P:wmcs::services::toolsdb_replica_cnf: don't manage the directories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888311 (https://phabricator.wikimedia.org/T329377) (owner: 10Majavah) [13:03:48] (03CR) 10Andrew Bogott: [C: 03+1] P:wmcs::services::toolsdb_replica_cnf: don't manage the directories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888311 (https://phabricator.wikimedia.org/T329377) (owner: 10Majavah) [13:05:33] !log volans@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9 to netbox-next - volans@cumin1001 [13:06:17] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-private-data for Fgoodwin - https://phabricator.wikimedia.org/T329404 (10cmooney) @Ottomata or @odimitrijevic could I ask one of you to approve this request? Thanks. [13:08:59] (03PS1) 10Slyngshede: P:installserver::dhcp remove dhcp config for VMs [puppet] - 10https://gerrit.wikimedia.org/r/888692 [13:12:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T328817)', diff saved to https://phabricator.wikimedia.org/P44404 and previous config saved to /var/cache/conftool/dbconfig/20230213-131213-marostegui.json [13:12:17] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [13:13:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T329203)', diff saved to https://phabricator.wikimedia.org/P44405 and previous config saved to /var/cache/conftool/dbconfig/20230213-131327-marostegui.json [13:13:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance [13:13:31] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [13:13:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance [13:13:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T329203)', diff saved to https://phabricator.wikimedia.org/P44406 and previous config saved to /var/cache/conftool/dbconfig/20230213-131348-marostegui.json [13:17:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T329203)', diff saved to https://phabricator.wikimedia.org/P44407 and previous config saved to /var/cache/conftool/dbconfig/20230213-131703-marostegui.json [13:19:10] (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate labstore1006.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:21:04] (03CR) 10Klausman: [C: 03+1] changeprop: use a more generic name for events in liftwing's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/888653 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [13:22:58] (03PS1) 10Majavah: P:puppetmaster::frontend: remove certificate NRPE check [puppet] - 10https://gerrit.wikimedia.org/r/888695 [13:27:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P44408 and previous config saved to /var/cache/conftool/dbconfig/20230213-132719-marostegui.json [13:27:32] PROBLEM - Check systemd state on puppetdb2003 is CRITICAL: CRITICAL - degraded: The following units failed: postgresql@15-main.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:30:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:32:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P44409 and previous config saved to /var/cache/conftool/dbconfig/20230213-133210-marostegui.json [13:33:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:34:23] (03PS1) 10Filippo Giunchedi: wmnet: add logs-api svc records [dns] - 10https://gerrit.wikimedia.org/r/888696 (https://phabricator.wikimedia.org/T320702) [13:42:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P44410 and previous config saved to /var/cache/conftool/dbconfig/20230213-134226-marostegui.json [13:45:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P44411 and previous config saved to /var/cache/conftool/dbconfig/20230213-134716-marostegui.json [13:48:07] (03PS1) 10Filippo Giunchedi: Add logs-api service [puppet] - 10https://gerrit.wikimedia.org/r/888700 (https://phabricator.wikimedia.org/T320702) [13:49:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:23] (03CR) 10Filippo Giunchedi: "For reference I've been following https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service" [puppet] - 10https://gerrit.wikimedia.org/r/888700 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [13:51:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:54:23] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 6677 [13:54:59] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 6677 [13:56:03] (03PS1) 10Bartosz Dziewoński: Follow-up I3412c53cc: Fix reference to target in ve.ce.MWWikitextSurface [extensions/VisualEditor] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/888343 (https://phabricator.wikimedia.org/T329439) [13:56:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:57:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T328817)', diff saved to https://phabricator.wikimedia.org/P44412 and previous config saved to /var/cache/conftool/dbconfig/20230213-135732-marostegui.json [13:57:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2166.codfw.wmnet with reason: Maintenance [13:57:36] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [13:57:46] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-private-data for Fgoodwin - https://phabricator.wikimedia.org/T329404 (10Ottomata) Approved! This needs kerberos access too. [13:57:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2166.codfw.wmnet with reason: Maintenance [13:57:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2166 (T328817)', diff saved to https://phabricator.wikimedia.org/P44413 and previous config saved to /var/cache/conftool/dbconfig/20230213-135753-marostegui.json [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: That opportune time is upon us again. Time for a UTC afternoon backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230213T1400). [14:00:05] mazevedo, Jhs, dcausse, and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:11] o/ [14:00:30] I’d like to deploy at least the guwwiktionary part, since I recently had issues with namespaceDupes on another wiki [14:00:45] hi [14:00:49] but I’d probably postpone that to the end of the window and let the other deployments happen first [14:00:57] (i can do those too but if someone else wants to, feel free ^^) [14:01:49] hi [14:02:12] o/ [14:02:21] (03PS1) 10Andrew Bogott: cloud-vps/galera: standardize timeouts across haprox/mysql [puppet] - 10https://gerrit.wikimedia.org/r/888703 (https://phabricator.wikimedia.org/T328155) [14:02:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T329203)', diff saved to https://phabricator.wikimedia.org/P44414 and previous config saved to /var/cache/conftool/dbconfig/20230213-140222-marostegui.json [14:02:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance [14:02:26] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [14:02:27] let’s start the gate-and-submit for MatmaRex [14:02:31] !log elukey@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23 [14:02:35] !log elukey@cumin1001 END (FAIL) - Cookbook sre.k8s.upgrade-cluster (exit_code=99) Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23 [14:02:36] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Jclark-ctr) @Marostegui removed drive [14:02:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance [14:02:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T329203)', diff saved to https://phabricator.wikimedia.org/P44415 and previous config saved to /var/cache/conftool/dbconfig/20230213-140243-marostegui.json [14:02:46] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "starting gate-and-submit, I’ll hopefully +2 again via `scap backport` in time :)" [extensions/VisualEditor] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/888343 (https://phabricator.wikimedia.org/T329439) (owner: 10Bartosz Dziewoński) [14:02:53] and then start with mazevedo [14:03:08] (03PS2) 10Andrew Bogott: cloud-vps/galera: standardize timeouts across haprox/mysql [puppet] - 10https://gerrit.wikimedia.org/r/888703 (https://phabricator.wikimedia.org/T328155) [14:03:30] i'm ready! [14:03:30] neat, I didn’t know stash supported multiple task IDs in one Bug: line [14:03:36] usually I see one line per task ^ [14:03:38] * ^^ [14:04:13] * Jhs is here [14:04:21] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps/galera: standardize timeouts across haprox/mysql [puppet] - 10https://gerrit.wikimedia.org/r/888703 (https://phabricator.wikimedia.org/T328155) (owner: 10Andrew Bogott) [14:04:22] mazevedo: any idea why the diffConfig build didn’t detect any changes? [14:04:30] shouldn’t it show up in wgEventStreams for all wikis? [14:04:35] i have no idea :/ [14:04:38] (03PS1) 10Jbond: postgresql::user: Check some password is set for a user [puppet] - 10https://gerrit.wikimedia.org/r/888704 [14:04:43] meh, then let’s deploy anyways [14:04:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887998 (https://phabricator.wikimedia.org/T328697) (owner: 10Mazevedo) [14:04:56] will this be testable on mwdebug? [14:05:02] yes! [14:05:06] ok [14:05:09] sounds good enough :) [14:05:23] * Lucas_WMDE waves to dcausse and Jhs too [14:05:26] (03Merged) 10jenkins-bot: Add iOS stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887998 (https://phabricator.wikimedia.org/T328697) (owner: 10Mazevedo) [14:05:38] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:887998|Add iOS stream config]] [14:05:51] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [14:05:52] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [14:06:49] 👋 [14:07:04] !log upload node-bgpalerter_1.31.2 to apt [14:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:19] (03PS1) 10Elukey: sre.k8s.upgrade-cluster: fix puppet admin reason [cookbooks] - 10https://gerrit.wikimedia.org/r/888706 (https://phabricator.wikimedia.org/T327767) [14:07:23] !log lucaswerkmeister-wmde@deploy1002 mazevedo and lucaswerkmeister-wmde: Backport for [[gerrit:887998|Add iOS stream config]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [14:07:31] mazevedo: please test, then :) [14:07:42] on it [14:07:53] (03CR) 10Klausman: [C: 03+1] sre.k8s.upgrade-cluster: fix puppet admin reason [cookbooks] - 10https://gerrit.wikimedia.org/r/888706 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [14:07:54] Jhs: fyi, I’m planning to deploy your change last, since there might be issues with namespaceDupes [14:08:11] (on shnwikibooks it had issues with the linktarget table migration) [14:08:56] o_O I just noticed that diffConfig detected no config difference in dcausse’s change either [14:09:01] :/ [14:09:02] is it possible that it’s broken in general? [14:09:32] * Lucas_WMDE frowns [14:09:37] Lucas_WMDE all configs are working, thank you so much :) [14:09:40] shouldn’t there be *two* buildConfigCache.php commands in https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig-docker/1638/console ? [14:09:42] mazevedo: ok! [14:09:45] thanks for testing [14:10:28] yeah, in the older build https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig-docker/1452/consoleFull there are two buildConfigCache.php, one before the git checkout [14:10:34] that one got lost, it seems [14:10:43] !log nfraison@cumin1001 START - Cookbook sre.ganeti.reimage for host an-test-presto1001.eqiad.wmnet with OS bullseye [14:11:00] so now, in https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig-docker/1638/console, the `git add -f tests/data/config-cache/` probably adds nothing to the index [14:11:00] (03CR) 10Elukey: [C: 03+2] sre.k8s.upgrade-cluster: fix puppet admin reason [cookbooks] - 10https://gerrit.wikimedia.org/r/888706 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [14:11:03] since the cache hasn’t been built yet [14:11:08] !log nfraison@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host an-test-presto1001.eqiad.wmnet with OS bullseye [14:11:13] and then the second `buildConfigCache.php` will just produce a bunch of untracked files [14:11:14] and no git diff [14:11:18] * Lucas_WMDE looks at phabricator [14:11:42] no obvious existing task, I’ll make one [14:12:30] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on cloudcephosd1001.eqiad.wmnet with reason: moving racks [14:12:43] !log elukey@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23 [14:12:44] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cloudcephosd1001.eqiad.wmnet with reason: moving racks [14:12:48] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on cloudcephosd1002.eqiad.wmnet with reason: moving racks [14:13:02] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cloudcephosd1002.eqiad.wmnet with reason: moving racks [14:13:09] (03PS2) 10Jbond: postgresql::user: Check some password is set for a user [puppet] - 10https://gerrit.wikimedia.org/r/888704 [14:13:11] (03PS1) 10Bking: wdqs data reload: clear out unused option [cookbooks] - 10https://gerrit.wikimedia.org/r/888707 (https://phabricator.wikimedia.org/T301167) [14:13:25] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-staging-etcd2001.codfw.wmnet with OS bullseye [14:15:08] 10SRE, 10ops-codfw, 10SRE Observability (FY2022/2023-Q3): Decommission netmon2001 - https://phabricator.wikimedia.org/T322695 (10Jhancock.wm) [14:15:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:33] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: sync [14:15:45] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:887998|Add iOS stream config]] (duration: 10m 06s) [14:15:47] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync [14:16:00] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: sync [14:16:05] okay, let’s do MatmaRex next, the gate-and-submit is almost done [14:16:13] (03CR) 10DCausse: [C: 03+1] wdqs data reload: clear out unused option [cookbooks] - 10https://gerrit.wikimedia.org/r/888707 (https://phabricator.wikimedia.org/T301167) (owner: 10Bking) [14:16:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/888343 (https://phabricator.wikimedia.org/T329439) (owner: 10Bartosz Dziewoński) [14:16:20] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync [14:16:30] thanks [14:16:34] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one typo. The number of iterations (4096) the test checks against appears to be a compile time (SCRAM_ITERATIONS_DEFAULT), so " [puppet] - 10https://gerrit.wikimedia.org/r/888704 (owner: 10Jbond) [14:16:38] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync [14:16:42] 10SRE, 10ops-codfw, 10SRE Observability (FY2022/2023-Q3): Decommission netmon2001 - https://phabricator.wikimedia.org/T322695 (10Jhancock.wm) a:05Jhancock.wm→03Papaul server confirmed off. older than 5 years. removed disks and server. disk in bin, server moved to storage [14:17:04] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync [14:17:04] made T329518 for the diffConfig issue btw [14:17:06] T329518: diffConfig no longer detecs any changes in operations/mediawiki-config.git - https://phabricator.wikimedia.org/T329518 [14:17:30] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:17:34] (might need some more tags, I wasn’t sure which ones to add) [14:18:21] (03CR) 10David Caro: [C: 03+2] wmcs ceph:Move cloudcephosd1001/1002 to e4 [puppet] - 10https://gerrit.wikimedia.org/r/888659 (https://phabricator.wikimedia.org/T329498) (owner: 10David Caro) [14:18:38] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, I'll merge if this looks good to you John ?" [puppet] - 10https://gerrit.wikimedia.org/r/888695 (owner: 10Majavah) [14:18:52] (03CR) 10David Caro: "Have to wait for the reimage to give me the new ips and recheck here xd" [puppet] - 10https://gerrit.wikimedia.org/r/888659 (https://phabricator.wikimedia.org/T329498) (owner: 10David Caro) [14:19:57] (03PS3) 10Jbond: postgresql::user: Check some password is set for a user [puppet] - 10https://gerrit.wikimedia.org/r/888704 [14:20:07] (03CR) 10Jbond: postgresql::user: Check some password is set for a user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888704 (owner: 10Jbond) [14:20:26] (03PS1) 10Aklapper: Set wmgUseGraphWithJsonNamespace = false for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888708 (https://phabricator.wikimedia.org/T124748) [14:20:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:45] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: sync [14:20:53] (03CR) 10Jbond: [C: 03+2] postgresql::user: Check some password is set for a user [puppet] - 10https://gerrit.wikimedia.org/r/888704 (owner: 10Jbond) [14:20:57] (03CR) 10Bking: [C: 03+2] wdqs data reload: clear out unused option [cookbooks] - 10https://gerrit.wikimedia.org/r/888707 (https://phabricator.wikimedia.org/T301167) (owner: 10Bking) [14:21:00] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: sync [14:21:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T328817)', diff saved to https://phabricator.wikimedia.org/P44416 and previous config saved to /var/cache/conftool/dbconfig/20230213-142105-marostegui.json [14:21:05] (03PS2) 10Bking: wdqs data reload: clear out unused option [cookbooks] - 10https://gerrit.wikimedia.org/r/888707 (https://phabricator.wikimedia.org/T301167) [14:21:06] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: sync [14:21:09] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [14:21:10] !log volans@cumin1001 END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9 to netbox-next - volans@cumin1001 [14:21:15] (03CR) 10Bking: [V: 03+2] wdqs data reload: clear out unused option [cookbooks] - 10https://gerrit.wikimedia.org/r/888707 (https://phabricator.wikimedia.org/T301167) (owner: 10Bking) [14:21:35] the netbox dns alert is me, looking [14:21:43] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: sync [14:21:57] !log nfraison@cumin1001 START - Cookbook sre.ganeti.reimage for host an-test-presto1001.eqiad.wmnet with OS bullseye [14:22:06] !log nfraison@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host an-test-presto1001.eqiad.wmnet with OS bullseye [14:22:18] (03PS1) 10Muehlenhoff: Add binder to the kernel module block list [puppet] - 10https://gerrit.wikimedia.org/r/888709 [14:24:27] !log filippo@cumin1001 START - Cookbook sre.dns.netbox [14:25:04] wait, now the gate-and-submit ETA is 14 minutes [14:25:09] did I misread it earlier or did it jump up? o_O [14:25:34] :o [14:25:53] no idea, i haven't looked earlier [14:25:58] but it looks like wmf-quibble-vendor-mysql-php74-docker has just started [14:26:04] yeah [14:26:16] let’s deploy dcausse in the meantime… [14:26:19] :) [14:26:24] !log lucaswerkmeister-wmde@deploy1002 backport aborted: (duration: 10m 13s) [14:26:30] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-staging-etcd2001.codfw.wmnet with reason: host reimage [14:26:32] maybe it got restarted somehow? weird [14:26:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888178 (https://phabricator.wikimedia.org/T327878) (owner: 10DCausse) [14:27:02] dcausse: will that one be testable on mwdebug, or does it need a reindex or something? [14:27:15] !log filippo@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add logs-api VIP - filippo@cumin1001" [14:27:20] (03PS1) 10Nicolas Fraison: chore(install_server): remove an-test-presto1001 entries for reimage [puppet] - 10https://gerrit.wikimedia.org/r/888711 [14:27:30] Lucas_WMDE: I can double check the config on mwdebug yet [14:27:34] s/yet/yes [14:27:44] (03PS2) 10Nicolas Fraison: chore(install_server): remove an-test-presto1001 entries for reimage [puppet] - 10https://gerrit.wikimedia.org/r/888711 (https://phabricator.wikimedia.org/T329361) [14:27:48] ok [14:27:52] (not yet ^^) [14:28:18] !log filippo@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add logs-api VIP - filippo@cumin1001" [14:28:18] !log filippo@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:28:40] PROBLEM - Dell PowerEdge RAID Controller on db1206 is CRITICAL: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [14:28:41] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on db1206 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T329522 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [14:28:45] 10SRE, 10ops-eqiad: Degraded RAID on db1206 - https://phabricator.wikimedia.org/T329522 (10ops-monitoring-bot) [14:28:53] (03CR) 10Volans: [C: 03+1] "I'm no familiar with gitlab's python APIs but looks sane to me" [cookbooks] - 10https://gerrit.wikimedia.org/r/888686 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [14:29:25] (03CR) 10Btullis: [C: 03+1] "Great. This will permit us to test the new sre.ganeti.reimage cookbook on this node. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/888711 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [14:29:30] 10SRE, 10ops-eqiad: Degraded RAID on db1206 - https://phabricator.wikimedia.org/T329522 (10Marostegui) 05Open→03Declined This is a test part of T325046 [14:30:01] (03CR) 10Filippo Giunchedi: [C: 03+1] "I got a PCC failure but it doesn't seem related to this change: https://puppet-compiler.wmflabs.org/output/888695/39525/" [puppet] - 10https://gerrit.wikimedia.org/r/888695 (owner: 10Majavah) [14:30:06] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1206 - https://phabricator.wikimedia.org/T329522 (10Marostegui) [14:30:40] hmph, the mediawiki-config gate-and-submit is still queued [14:30:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:30:50] ci go faster plz [14:31:01] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) The auto-generated task looks good now T329522 @Jclark-ctr can you insert the disk again whenever you've got time? Thanks! [14:31:08] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-staging-etcd2001.codfw.wmnet with reason: host reimage [14:31:09] (03CR) 10Nicolas Fraison: [C: 03+2] chore(install_server): remove an-test-presto1001 entries for reimage [puppet] - 10https://gerrit.wikimedia.org/r/888711 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [14:31:15] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) I will close this task once the disk is back and the RAID is back to Optimal. Thanks @MoritzMuehlenhoff for all the help [14:31:37] (03PS4) 10Jelto: sre.gitlab.upgrade: post broadcast message during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/888686 (https://phabricator.wikimedia.org/T323569) [14:34:02] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: sync [14:34:38] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: sync [14:34:43] some builds look like they’re stuck on something but I can’t really tell what’s going on [14:34:51] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: sync [14:35:01] I’ll just deploy what i can of the changes that do make it through gate-and-submit, I guess [14:35:01] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: sync [14:35:54] (03Merged) 10jenkins-bot: [cirrus] enable CirrusSearchCompletionSuggesterUseDefaultSort on mnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888178 (https://phabricator.wikimedia.org/T327878) (owner: 10DCausse) [14:36:06] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:888178|[cirrus] enable CirrusSearchCompletionSuggesterUseDefaultSort on mnwiki (T327878)]] [14:36:11] (03CR) 10Vgutierrez: "https://puppet-compiler.wmflabs.org/output/888652/39516/ looks good on production, cloudinfra-acme-chief-01.cloudinfra.eqiad1.wikimedia.cl" [puppet] - 10https://gerrit.wikimedia.org/r/888652 (https://phabricator.wikimedia.org/T321309) (owner: 10Vgutierrez) [14:36:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P44417 and previous config saved to /var/cache/conftool/dbconfig/20230213-143611-marostegui.json [14:36:19] T327878: Tweak Autocomplete search results on the Mongolian Wikipedia - https://phabricator.wikimedia.org/T327878 [14:37:05] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: sync [14:37:46] !log lucaswerkmeister-wmde@deploy1002 dcausse and lucaswerkmeister-wmde: Backport for [[gerrit:888178|[cirrus] enable CirrusSearchCompletionSuggesterUseDefaultSort on mnwiki (T327878)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [14:38:00] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: sync [14:38:00] looking ^ [14:38:03] ok! [14:38:33] !log nfraison@cumin1001 START - Cookbook sre.ganeti.reimage for host an-test-presto1001.eqiad.wmnet with OS bullseye [14:38:41] !log nfraison@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host an-test-presto1001.eqiad.wmnet with OS bullseye [14:38:44] hm, I tried searching for Мөрдорж and it’s not showing any results… [14:38:52] RECOVERY - Check systemd state on puppetdb2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:39:07] Lucas_WMDE: it requires a reindex, but I can't see my new config option at https://mn.wikipedia.org/w/api.php?action=cirrus-config-dump&format=json&formatversion=2 [14:39:29] hmm [14:39:50] 10SRE, 10serviceops: k8s/mw: traffic to eventgate dropped by iptables - https://phabricator.wikimedia.org/T249700 (10cmooney) I looked at this last week, and while I didn't get to the bottom of it, my //suspicion// here is that the logs are due to a race-condition when the connection terminates. I specificall... [14:39:57] (03CR) 10Vgutierrez: acme_chief: support several passive hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888652 (https://phabricator.wikimedia.org/T321309) (owner: 10Vgutierrez) [14:40:13] dcausse: could that be because it’s not listed in ConfigDump::$PUBLICLY_SHAREABLE_CONFIG_VARS? [14:40:29] (03CR) 10Jelto: [C: 03+2] sre.gitlab.upgrade: post broadcast message during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/888686 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [14:40:30] Lucas_WMDE: oh good catch! [14:40:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:41:02] $wgCirrusSearchCompletionSuggesterUseDefaultSort is true in `mwscript shell.php mnwiki`, at least [14:41:24] ^ same [14:41:30] Lucas_WMDE: all good then :) [14:41:36] ok, then let’s sync [14:41:40] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: sync [14:41:45] and you can do the reindex and see if it works then [14:41:56] normal search still seems to work at least, so hopefully it’s harmless [14:42:06] (I searched for the title of the featured article ^^) [14:42:07] yes, I'll run this after the window [14:42:19] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1001.eqiad.wmnet with OS bullseye [14:42:35] 10SRE, 10Traffic, 10Data Pipelines (Sprint 08): Document Impact of Jan 8&9 Traffic Data Loss - https://phabricator.wikimedia.org/T326658 (10Snwachukwu) See wikitech documentation [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Data_Issues/2023-01-08_Webrequest_Data_Loss | here ]]. [14:42:35] !log nfraison@cumin1001 START - Cookbook sre.ganeti.reimage for host an-test-presto1001.eqiad.wmnet with OS bullseye [14:42:41] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: post broadcast message during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/888686 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [14:42:43] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1001.eqiad.wmnet with OS bullseye [14:42:44] !log nfraison@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host an-test-presto1001.eqiad.wmnet with OS bullseye [14:42:49] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: sync [14:43:17] for a moment I thought "what?" when I've read "cloudcephosd1001.eqiad.wmnet [14:43:32] I got the same issue before, due to sudo + tmux [14:43:45] there is somebody else reimaging [14:44:06] it's dcaro :) [14:44:18] just pinged him [14:45:30] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Test Upgrade GitLab Replica gitlab1003 same version (noop) [14:45:59] elukey: that was me xd [14:46:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:06] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:46:10] dcaro: <3 :D [14:46:18] (03CR) 10Herron: wmnet: add logs-api svc records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/888696 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [14:46:40] !log nfraison@cumin1001 START - Cookbook sre.ganeti.reimage for host an-test-presto1001.eqiad.wmnet with OS bullseye [14:46:45] (03CR) 10Herron: [C: 03+1] "LGTM pending merge of dns patch" [puppet] - 10https://gerrit.wikimedia.org/r/888700 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [14:46:49] nfraison: o/ the cookbook that you are using is very new and under testing, ping slyngs to inform them that you are using it (so in case of any error you can follow up etc..) [14:47:12] (03CR) 10Volans: [C: 03+1] "LGTM! Ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/888692 (owner: 10Slyngshede) [14:47:15] any idea why the error '/etc/dhcp/automation/ttyS0-115200/an-test-presto1001.conf line 6: host an-test-presto1001: already exists', anyone playing with that host? [14:47:24] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:888178|[cirrus] enable CirrusSearchCompletionSuggesterUseDefaultSort on mnwiki (T327878)]] (duration: 11m 18s) [14:47:28] T327878: Tweak Autocomplete search results on the Mongolian Wikipedia - https://phabricator.wikimedia.org/T327878 [14:47:31] dcaro: yeah nfraison [14:47:42] Lucas_WMDE: thanks for deploy! :) [14:47:49] slyngs: ^^^^ [14:47:50] np :) [14:48:07] okay, let’s resume MatmaRex then [14:48:15] (the phpunit tests are currently at ~40%) [14:48:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/888343 (https://phabricator.wikimedia.org/T329439) (owner: 10Bartosz Dziewoński) [14:50:17] (03CR) 10Lucas Werkmeister (WMDE): Rename project namespace in guwwiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888200 (https://phabricator.wikimedia.org/T309054) (owner: 10Jon Harald Søby) [14:50:19] nfraison: please keep me updated, I'm kinda blocked by the reimage [14:50:29] dcaro: are you trying to reimage a vm? [14:50:37] elukey: no, a bare metal host [14:50:44] but it complains about that one I pasted [14:50:55] ack perfect, because there was https://gerrit.wikimedia.org/r/c/operations/puppet/+/888692 pending (I think this is why nfraison's run is failing) [14:51:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P44418 and previous config saved to /var/cache/conftool/dbconfig/20230213-145117-marostegui.json [14:51:35] nfraison: so https://gerrit.wikimedia.org/r/c/operations/puppet/+/888692 needs to be merged before you can run your cookbook successfully [14:51:47] elukey: yep, that seems to be it! [14:52:18] I am also blocked since my cookbook will fail as well (need to run multiple reimages) [14:52:21] elukey, nfraison either that or remove the single host from modules/install_server/files/dhcpd/linux-host-entries.ttyS0-115200 before running the cookbook [14:52:45] slyngs: just merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/888692 [14:52:57] +1 yeah [14:53:03] let's do it [14:53:24] (03Merged) 10jenkins-bot: Follow-up I3412c53cc: Fix reference to target in ve.ce.MWWikitextSurface [extensions/VisualEditor] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/888343 (https://phabricator.wikimedia.org/T329439) (owner: 10Bartosz Dziewoński) [14:53:26] at last \o/ [14:53:28] (03CR) 10Elukey: [C: 03+1] P:installserver::dhcp remove dhcp config for VMs [puppet] - 10https://gerrit.wikimedia.org/r/888692 (owner: 10Slyngshede) [14:53:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:38] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:888343|Follow-up I3412c53cc: Fix reference to target in ve.ce.MWWikitextSurface (T329439)]] [14:53:42] T329439: "Uncaught TypeError: this.getTarget is not a function" when pasting in 2017 editor - https://phabricator.wikimedia.org/T329439 [14:55:15] elukey: I just need to do a rebase, just a sec. [14:55:20] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and matmarex: Backport for [[gerrit:888343|Follow-up I3412c53cc: Fix reference to target in ve.ce.MWWikitextSurface (T329439)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [14:56:00] (03PS1) 10Jbond: postgresql::user: add trailing \' [puppet] - 10https://gerrit.wikimedia.org/r/888716 [14:56:02] Lucas_WMDE: looks good :) [14:56:06] volans, slyngs so https://gerrit.wikimedia.org/r/c/operations/puppet/+/888711 was merged earlier on [14:56:07] yay [14:56:09] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39527/console" [puppet] - 10https://gerrit.wikimedia.org/r/888695 (owner: 10Majavah) [14:56:19] I tried to reproduce it myself but didn’t manage in time ^^ [14:56:31] !log nfraison@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-presto1001.eqiad.wmnet with reason: host reimage [14:56:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39526/console" [puppet] - 10https://gerrit.wikimedia.org/r/888695 (owner: 10Majavah) [14:56:39] * volans about to enter in a meeting [14:56:41] (03CR) 10Majavah: [V: 03+1] P:puppetmaster::frontend: remove certificate NRPE check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888695 (owner: 10Majavah) [14:57:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39528/console" [puppet] - 10https://gerrit.wikimedia.org/r/888716 (owner: 10Jbond) [14:57:44] (03CR) 10Jon Harald Søby: Rename project namespace in guwwiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888200 (https://phabricator.wikimedia.org/T309054) (owner: 10Jon Harald Søby) [14:57:48] (03CR) 10Jbond: [V: 03+1 C: 03+2] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/888695 (owner: 10Majavah) [14:57:50] (03PS3) 10Jon Harald Søby: Rename project namespace in guwwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888200 (https://phabricator.wikimedia.org/T309054) [14:58:29] (03CR) 10Jbond: [V: 03+1 C: 03+2] postgresql::user: add trailing \' [puppet] - 10https://gerrit.wikimedia.org/r/888716 (owner: 10Jbond) [14:58:48] (03CR) 10Lucas Werkmeister (WMDE): Rename project namespace in guwwiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888200 (https://phabricator.wikimedia.org/T309054) (owner: 10Jon Harald Søby) [14:59:39] !log nfraison@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-presto1001.eqiad.wmnet with reason: host reimage [15:00:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:27] eluky: yes my bad I haven't seen that I was needing firstly to delete the entry in puppet before deleting the ttyS0-115200/an-test-presto1001.conf file , it is now merge and applied [15:01:04] nfraison: Ah, Okay, is it working now? Or how it's progressing? [15:01:55] yes it is working now (applying first puppet run) [15:02:03] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:888343|Follow-up I3412c53cc: Fix reference to target in ve.ce.MWWikitextSurface (T329439)]] (duration: 08m 25s) [15:02:07] T329439: "Uncaught TypeError: this.getTarget is not a function" when pasting in 2017 editor - https://phabricator.wikimedia.org/T329439 [15:02:31] jouncebot: nowandnext [15:02:31] No deployments scheduled for the next 1 hour(s) and 27 minute(s) [15:02:31] In 1 hour(s) and 27 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230213T1630) [15:02:48] if nobody minds, I’d like to continue the backport+config window to deploy the last change (guwwiktionary project ns) [15:03:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T329203)', diff saved to https://phabricator.wikimedia.org/P44419 and previous config saved to /var/cache/conftool/dbconfig/20230213-150259-marostegui.json [15:03:03] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [15:03:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888200 (https://phabricator.wikimedia.org/T309054) (owner: 10Jon Harald Søby) [15:04:07] nfraison: Wonderful, I'll get the patch merged that remove this requirement [15:04:11] (03Merged) 10jenkins-bot: Rename project namespace in guwwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888200 (https://phabricator.wikimedia.org/T309054) (owner: 10Jon Harald Søby) [15:04:18] dcaro: it should be fine now [15:04:23] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:888200|Rename project namespace in guwwiktionary (T309054)]] [15:04:27] T309054: Create Wiktionary Gungbe - https://phabricator.wikimedia.org/T309054 [15:04:29] nfraison: thanks! [15:05:11] (03PS1) 10Btullis: Try libmariadb-java with sqoop on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/888718 (https://phabricator.wikimedia.org/T329363) [15:06:02] !log lucaswerkmeister-wmde@deploy1002 jhsoby and lucaswerkmeister-wmde: Backport for [[gerrit:888200|Rename project namespace in guwwiktionary (T309054)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [15:06:21] Jhs: can you test on mwdebug? [15:06:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T328817)', diff saved to https://phabricator.wikimedia.org/P44420 and previous config saved to /var/cache/conftool/dbconfig/20230213-150623-marostegui.json [15:06:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2167.codfw.wmnet with reason: Maintenance [15:06:27] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [15:06:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2167.codfw.wmnet with reason: Maintenance [15:06:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3318 (T328817)', diff saved to https://phabricator.wikimedia.org/P44421 and previous config saved to /var/cache/conftool/dbconfig/20230213-150644-marostegui.json [15:06:58] Lucas_WMDE, works like it should (well, not perfectly until the script is run, but nothing out of the ordinary) [15:07:07] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39529/console" [puppet] - 10https://gerrit.wikimedia.org/r/888718 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [15:07:07] okay! [15:07:11] syncing [15:07:14] and then we’ll see how the script goes [15:07:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:41] i suspect the script run will be in the milliseconds range, there are only like 3 affected pages [15:09:59] (03CR) 10AOkoth: [C: 03+2] vrts: enable/disable daemon depending on active host [puppet] - 10https://gerrit.wikimedia.org/r/886914 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [15:12:57] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:888200|Rename project namespace in guwwiktionary (T309054)]] (duration: 08m 33s) [15:13:03] T309054: Create Wiktionary Gungbe - https://phabricator.wikimedia.org/T309054 [15:13:11] ok, let’s see how the script behaves then [15:13:46] hehe, the dry run says there’s nothing to do [15:14:32] yeah, that sounds right. it would only have something to do if there were any conflicts, but there shouldn't be [15:14:33] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Mon 20 Feb 2023 05:31:14 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:14:53] 10SRE, 10Gerrit, 10LDAP: Change LDAP cn to something more useful (was Rename "Dzahn" to "Daniel Zahn" in Gerrit) - https://phabricator.wikimedia.org/T113792 (10xcollazo) [15:14:54] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes.php guwwiktionary --fix | tee T309054-namespaceDupes.out # T309054 [0 pages to fix, 0 were resolvable; 0 links to fix, 0 were resolvable; 0 were deleted] [15:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:03] (03PS3) 10Zabe: stop setting checkuser actor/comment migration variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886476 (https://phabricator.wikimedia.org/T233004) [15:15:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:21] which means no opportunity to test T328634#8593132. oh well ^^ [15:15:21] T328634: Lost pages after deployed addtional namespaces on shn.wikibooks - https://phabricator.wikimedia.org/T328634 [15:15:28] !log UTC afternoon backport+config window done [15:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:48] Lucas_WMDE, thanks! [15:17:07] (03PS1) 10Jbond: postgresql::user: need to also include the escape in the final command [puppet] - 10https://gerrit.wikimedia.org/r/888719 [15:17:21] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 21 Apr 2023 05:11:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:18:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P44422 and previous config saved to /var/cache/conftool/dbconfig/20230213-151805-marostegui.json [15:18:07] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39530/console" [puppet] - 10https://gerrit.wikimedia.org/r/888719 (owner: 10Jbond) [15:18:31] (03CR) 10Jbond: [V: 03+1 C: 03+2] postgresql::user: need to also include the escape in the final command [puppet] - 10https://gerrit.wikimedia.org/r/888719 (owner: 10Jbond) [15:18:56] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops-radar: PROBLEM - IPMI Sensor Status is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status [codfw rack B6] - https://phabricator.wikimedia.org/T328343 (10Papaul) 05Open→03Resolved [15:19:33] jouncebot: nowandnext [15:19:33] No deployments scheduled for the next 1 hour(s) and 10 minute(s) [15:19:33] In 1 hour(s) and 10 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230213T1630) [15:20:35] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Test Upgrade GitLab Replica gitlab1003 same version (noop) [15:22:00] !log T327878: rebuilding CirrusSearch completion index on mnwiki from mwmaint1002 [15:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:04] T327878: Tweak Autocomplete search results on the Mongolian Wikipedia - https://phabricator.wikimedia.org/T327878 [15:26:21] (03PS2) 10Muehlenhoff: Add binder to the kernel module block list [puppet] - 10https://gerrit.wikimedia.org/r/888709 [15:27:10] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2438.mgmt.codfw.wmnet with reboot policy FORCED [15:27:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T328817)', diff saved to https://phabricator.wikimedia.org/P44423 and previous config saved to /var/cache/conftool/dbconfig/20230213-153025-marostegui.json [15:30:29] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [15:32:12] (03CR) 10Ayounsi: [C: 03+2] Remove single contact feature [puppet] - 10https://gerrit.wikimedia.org/r/883700 (owner: 10Ayounsi) [15:33:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44424 and previous config saved to /var/cache/conftool/dbconfig/20230213-153309-root.json [15:33:30] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ml-staging-etcd2001.codfw.wmnet with OS bullseye [15:34:16] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-staging-etcd2002.codfw.wmnet with OS bullseye [15:35:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2113.codfw.wmnet with reason: Maintenance [15:35:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2113.codfw.wmnet with reason: Maintenance [15:36:10] (03CR) 10Andrew Bogott: [C: 03+1] "lgtm but please ping before merge" [puppet] - 10https://gerrit.wikimedia.org/r/888236 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:38:29] !log disable puppet on A:dns-rec; merging CR 888236 [15:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:41] (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsrecursor: update template to prepare for bullseye upgrade [puppet] - 10https://gerrit.wikimedia.org/r/888236 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:38:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1100.eqiad.wmnet with reason: Maintenance [15:38:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1100.eqiad.wmnet with reason: Maintenance [15:40:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2098.codfw.wmnet with reason: Maintenance [15:40:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2098.codfw.wmnet with reason: Maintenance [15:41:29] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/888709 (owner: 10Muehlenhoff) [15:42:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [15:42:59] !log restarting blazegraph on wdqs1004 (BlazegraphFreeAllocatorsDecreasingRapidly) [15:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:54] (03PS1) 10Muehlenhoff: Remove role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/888722 (https://phabricator.wikimedia.org/T324475) [15:44:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2100.codfw.wmnet with reason: Maintenance [15:44:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2100.codfw.wmnet with reason: Maintenance [15:45:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P44425 and previous config saved to /var/cache/conftool/dbconfig/20230213-154531-marostegui.json [15:46:33] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-staging-etcd2002.codfw.wmnet with reason: host reimage [15:48:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44426 and previous config saved to /var/cache/conftool/dbconfig/20230213-154815-root.json [15:48:25] 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Sporadic RST drops in the ulogd logs - https://phabricator.wikimedia.org/T238823 (10jbond) >>! In T238823#8597868, @jbond wrote: > > i took a quick look at one of the db server (db1136) as mysql seems to be sending a lot of theses resets (with a... [15:48:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2108.codfw.wmnet with reason: Maintenance [15:48:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2108.codfw.wmnet with reason: Maintenance [15:48:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T329203)', diff saved to https://phabricator.wikimedia.org/P44427 and previous config saved to /var/cache/conftool/dbconfig/20230213-154850-marostegui.json [15:48:54] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [15:49:13] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-staging-etcd2002.codfw.wmnet with reason: host reimage [15:49:24] 10SRE, 10serviceops: k8s/mw: traffic to eventgate dropped by iptables - https://phabricator.wikimedia.org/T249700 (10jbond) FYI this seems very similar to what i observed with the mysql <-> mw connections https://phabricator.wikimedia.org/T238823#8597868. For that case it seemd like the issue may only be pres... [15:49:52] (03Abandoned) 10Jdrewniak: Add enwiki to desktop-improvements group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880531 (https://phabricator.wikimedia.org/T326892) (owner: 10Jdrewniak) [15:50:46] (JobUnavailable) resolved: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:51:05] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudcephosd1001.eqiad.wmnet [15:51:45] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [15:52:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [15:52:37] (03CR) 10DCausse: experimental: add support for custom flink-app config files (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/888231 (owner: 10DCausse) [15:52:42] (03PS6) 10DCausse: flink-app: add support for custom config files [deployment-charts] - 10https://gerrit.wikimedia.org/r/888231 [15:52:44] (03PS9) 10DCausse: [WIP] rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675) [15:52:59] (03Abandoned) 10DCausse: rdf-streaming-updater: add a config file and use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/888232 (owner: 10DCausse) [15:53:55] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) John has inserted the disk back and it is rebuilding: ` root@db1206:~# perccli64 /c0/e252/s2 show rebuild CLI Version = 007.1910.0000.000... [15:54:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T329203)', diff saved to https://phabricator.wikimedia.org/P44428 and previous config saved to /var/cache/conftool/dbconfig/20230213-155444-marostegui.json [15:54:48] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [15:55:04] 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10ayounsi) Would it be possible to check if their DHCP requests do have option 97 set (usually to their serial#)? Thanks [15:56:20] !log elukey@cumin1001 START - Cookbook sre.dns.netbox [15:58:03] (03CR) 10Dzahn: "I recommend not going from "no role" to "production role" in one step. It's likely to fail at first run. Instead I would first put the "in" [puppet] - 10https://gerrit.wikimedia.org/r/888690 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [15:58:16] 10SRE, 10Infrastructure-Foundations: Adapt profile::nginx to new packaging scheme introduced in Bullseye - https://phabricator.wikimedia.org/T329529 (10MoritzMuehlenhoff) [15:58:59] !log elukey@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephosd1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - elukey@cumin1001" [15:59:41] !log nfraison@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host an-test-presto1001.eqiad.wmnet with OS bullseye [16:00:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P44429 and previous config saved to /var/cache/conftool/dbconfig/20230213-160037-marostegui.json [16:01:22] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:01:50] !log jmm@cumin2002 START - Cookbook sre.elasticsearch.restart-nginx rolling restart_daemons on A:relforge [16:01:52] 10SRE, 10Infrastructure-Foundations: Adapt profile::nginx to new packaging scheme introduced in Bookworm - https://phabricator.wikimedia.org/T329529 (10taavi) [16:02:35] (03CR) 10Ottomata: [C: 03+1] "LGTM, ty! Chart.yaml version bump, and then merge at will!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/888231 (owner: 10DCausse) [16:02:37] !log elukey@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephosd1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - elukey@cumin1001" [16:02:38] !log elukey@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:02:38] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudcephosd1001.eqiad.wmnet [16:02:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.elasticsearch.restart-nginx (exit_code=0) rolling restart_daemons on A:relforge [16:03:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44431 and previous config saved to /var/cache/conftool/dbconfig/20230213-160320-root.json [16:07:16] !log jmm@cumin2002 START - Cookbook sre.elasticsearch.restart-nginx rolling restart_daemons on A:cloudelastic [16:07:38] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2023-02-16 11:40:18 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/ [16:09:28] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-04-17 10:55:19 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/ [16:09:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P44432 and previous config saved to /var/cache/conftool/dbconfig/20230213-160950-marostegui.json [16:10:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.elasticsearch.restart-nginx (exit_code=0) rolling restart_daemons on A:cloudelastic [16:10:11] (03PS2) 10Muehlenhoff: openstack::cinder::user: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/881885 [16:10:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:12:58] PROBLEM - Check systemd state on ml-staging2002 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:10] PROBLEM - Etcd cluster health on ml-staging-etcd2003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [16:13:22] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: kube-apiserver.service,kube-controller-manager.service,kube-scheduler.service,kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:28] PROBLEM - etcd service on ml-staging-etcd2003 is CRITICAL: CRITICAL - Expecting active but unit etcd is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:13:30] PROBLEM - Check systemd state on ml-staging2001 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:36] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: kube-apiserver.service,kube-controller-manager.service,kube-scheduler.service,kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:15:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T328817)', diff saved to https://phabricator.wikimedia.org/P44433 and previous config saved to /var/cache/conftool/dbconfig/20230213-161543-marostegui.json [16:15:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance [16:15:46] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:15:48] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [16:15:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance [16:16:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3318 (T328817)', diff saved to https://phabricator.wikimedia.org/P44434 and previous config saved to /var/cache/conftool/dbconfig/20230213-161605-marostegui.json [16:17:36] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder) [16:18:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44435 and previous config saved to /var/cache/conftool/dbconfig/20230213-161824-root.json [16:20:51] 10SRE-tools, 10Discovery-Search, 10Elasticsearch, 10Infrastructure-Foundations, 10Spicerack: elasticsearch spicerack module failes with most recent elastic-curator - https://phabricator.wikimedia.org/T328775 (10Gehel) 05Open→03Resolved a:03Gehel As far as I can see, this is done for now. We might w... [16:22:22] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Gehel) [16:23:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2438.mgmt.codfw.wmnet with reboot policy FORCED [16:23:47] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 3 others: March 2023 Datacenter Switchover Excluded services - https://phabricator.wikimedia.org/T329193 (10Gehel) [16:24:42] 10SRE-OnFire, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (Priority Backlog 📥), and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10thcipriani) [16:24:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P44436 and previous config saved to /var/cache/conftool/dbconfig/20230213-162456-marostegui.json [16:25:36] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2023-02-16 11:40:18 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/ [16:26:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:12:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T328817)', diff saved to https://phabricator.wikimedia.org/P44470 and previous config saved to /var/cache/conftool/dbconfig/20230213-191202-marostegui.json [19:12:37] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-staging2001.codfw.wmnet with reason: host reimage [19:15:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T329203)', diff saved to https://phabricator.wikimedia.org/P44471 and previous config saved to /var/cache/conftool/dbconfig/20230213-191516-marostegui.json [19:15:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2150.codfw.wmnet with reason: Maintenance [19:15:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2150.codfw.wmnet with reason: Maintenance [19:15:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T329203)', diff saved to https://phabricator.wikimedia.org/P44472 and previous config saved to /var/cache/conftool/dbconfig/20230213-191537-marostegui.json [19:15:49] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:17:26] (03CR) 10EoghanGaffney: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/888690 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [19:18:08] (03CR) 10Jdlrobson: [C: 03+1] Enable Page Tools for logged in users across all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888764 (https://phabricator.wikimedia.org/T328692) (owner: 10Bernard Wang) [19:20:10] (03CR) 10Btullis: [C: 03+1] fix(presto): create intermediate ${data_dir}/var fodler [puppet] - 10https://gerrit.wikimedia.org/r/888760 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [19:21:35] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:21:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T329203)', diff saved to https://phabricator.wikimedia.org/P44473 and previous config saved to /var/cache/conftool/dbconfig/20230213-192135-marostegui.json [19:27:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P44474 and previous config saved to /var/cache/conftool/dbconfig/20230213-192709-marostegui.json [19:28:28] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1001.eqiad.wmnet with OS bullseye [19:28:50] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1001.eqiad.wmnet with OS bullseye [19:29:42] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-staging2001.codfw.wmnet with OS bullseye [19:30:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:30:42] (03CR) 10CI reject: [V: 04-1] node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [19:30:46] (JobUnavailable) firing: (5) Reduced availability for job calico-felix in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:30:57] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-staging2002.codfw.wmnet with OS bullseye [19:31:28] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add Jon Amar WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T329324 (10KFrancis) @fgiunchedi The NDA is complete. Please proceed with the access request. Thanks! [19:33:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [19:33:42] (03CR) 10CI reject: [V: 04-1] Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761 (owner: 10JHathaway) [19:33:45] PROBLEM - Host ml-staging2002 is DOWN: PING CRITICAL - Packet loss = 100% [19:33:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [19:33:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T328255)', diff saved to https://phabricator.wikimedia.org/P44475 and previous config saved to /var/cache/conftool/dbconfig/20230213-193359-ladsgroup.json [19:35:09] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:36:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P44476 and previous config saved to /var/cache/conftool/dbconfig/20230213-193642-marostegui.json [19:38:29] RECOVERY - Host ml-staging2002 is UP: PING OK - Packet loss = 0%, RTA = 33.14 ms [19:41:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T328255)', diff saved to https://phabricator.wikimedia.org/P44477 and previous config saved to /var/cache/conftool/dbconfig/20230213-194116-ladsgroup.json [19:42:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P44478 and previous config saved to /var/cache/conftool/dbconfig/20230213-194216-marostegui.json [19:43:23] (03CR) 10JHathaway: "I split off the adding of the jaeger charts from the code to incorporate them into the aux cluster, which allows CI to pass for the charts" [deployment-charts] - 10https://gerrit.wikimedia.org/r/888276 (owner: 10JHathaway) [19:44:42] (03CR) 10JHathaway: "This hopefully will pass CI, once the parent task is reviewed, but I would still love any feedback on the approach until that happens." [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761 (owner: 10JHathaway) [19:45:29] PROBLEM - Host ml-staging2002 is DOWN: PING CRITICAL - Packet loss = 100% [19:45:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:47:01] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Use netbox GraphQL endpoint to fetch data [cookbooks] - 10https://gerrit.wikimedia.org/r/888051 (owner: 10Jbond) [19:47:18] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-staging2002.codfw.wmnet with reason: host reimage [19:47:19] RECOVERY - Host ml-staging2002 is UP: PING OK - Packet loss = 0%, RTA = 33.23 ms [19:47:26] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond) [19:47:58] (KubernetesCalicoDown) firing: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:48:47] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1001.eqiad.wmnet with OS bullseye [19:49:13] (KubernetesRsyslogDown) firing: rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=ml-staging-ctrl2001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:50:06] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-staging2002.codfw.wmnet with reason: host reimage [19:51:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P44479 and previous config saved to /var/cache/conftool/dbconfig/20230213-195148-marostegui.json [19:54:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:56:11] !log cmooney@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1001.eqiad.wmnet'] [19:56:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P44480 and previous config saved to /var/cache/conftool/dbconfig/20230213-195623-ladsgroup.json [19:56:26] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudcephosd1001.eqiad.wmnet'] [19:56:52] (03PS2) 10DDesouza: Remove Research Incentive survey from swwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865740 (https://phabricator.wikimedia.org/T321252) [19:57:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T328817)', diff saved to https://phabricator.wikimedia.org/P44481 and previous config saved to /var/cache/conftool/dbconfig/20230213-195722-marostegui.json [19:57:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1111.eqiad.wmnet with reason: Maintenance [19:57:26] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [19:57:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1111.eqiad.wmnet with reason: Maintenance [19:57:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1111 (T328817)', diff saved to https://phabricator.wikimedia.org/P44482 and previous config saved to /var/cache/conftool/dbconfig/20230213-195743-marostegui.json [19:59:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [19:59:09] RECOVERY - Check systemd state on ml-staging2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:05:59] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:06:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T329203)', diff saved to https://phabricator.wikimedia.org/P44483 and previous config saved to /var/cache/conftool/dbconfig/20230213-200654-marostegui.json [20:06:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2159.codfw.wmnet with reason: Maintenance [20:07:00] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [20:07:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2159.codfw.wmnet with reason: Maintenance [20:07:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [20:07:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [20:07:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T329203)', diff saved to https://phabricator.wikimedia.org/P44484 and previous config saved to /var/cache/conftool/dbconfig/20230213-200742-marostegui.json [20:08:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:11:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P44485 and previous config saved to /var/cache/conftool/dbconfig/20230213-201129-ladsgroup.json [20:12:34] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-staging2002.codfw.wmnet with OS bullseye [20:12:37] !log elukey@cumin1001 END (PASS) - Cookbook sre.k8s.upgrade-cluster (exit_code=0) Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23 [20:13:01] !log cmooney@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1001.eqiad.wmnet'] [20:13:05] 10SRE, 10DNS, 10Traffic-Icebox, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Remove aliases `minnan` and `zh-cfr` for the Min Nan Wikipedia - https://phabricator.wikimedia.org/T230382 (10BCornwall) p:05Medium→03Low Okay, I went through and updated non-user, non-talk and non-archive pages. @F... [20:13:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T329203)', diff saved to https://phabricator.wikimedia.org/P44486 and previous config saved to /var/cache/conftool/dbconfig/20230213-201336-marostegui.json [20:13:40] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [20:13:40] (03PS1) 10Bartosz Dziewoński: ReplyLinksController: Fix teardown failing when reloading [extensions/DiscussionTools] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/888768 (https://phabricator.wikimedia.org/T329523) [20:15:51] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:19:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:21:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:21:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T328817)', diff saved to https://phabricator.wikimedia.org/P44487 and previous config saved to /var/cache/conftool/dbconfig/20230213-202157-marostegui.json [20:22:02] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [20:22:58] (KubernetesCalicoDown) firing: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:23:53] (03CR) 10Ottomata: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888752 (https://phabricator.wikimedia.org/T325305) (owner: 10Ottomata) [20:24:04] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1001.eqiad.wmnet'] [20:26:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T328255)', diff saved to https://phabricator.wikimedia.org/P44488 and previous config saved to /var/cache/conftool/dbconfig/20230213-202635-ladsgroup.json [20:26:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance [20:26:46] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [20:26:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance [20:26:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P44489 and previous config saved to /var/cache/conftool/dbconfig/20230213-202656-ladsgroup.json [20:28:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P44490 and previous config saved to /var/cache/conftool/dbconfig/20230213-202842-marostegui.json [20:30:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:30:09] !log cmooney@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1001.eqiad.wmnet'] [20:32:57] !log restarting blazegraph on wdqs1012 (BlazegraphFreeAllocatorsDecreasingRapidly) [20:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P44491 and previous config saved to /var/cache/conftool/dbconfig/20230213-203413-ladsgroup.json [20:34:20] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [20:35:27] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:36:34] (03CR) 10Ottomata: [C: 03+2] Enable mediawiki.page_change on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888752 (https://phabricator.wikimedia.org/T325305) (owner: 10Ottomata) [20:37:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P44492 and previous config saved to /var/cache/conftool/dbconfig/20230213-203704-marostegui.json [20:37:14] (03Merged) 10jenkins-bot: Enable mediawiki.page_change on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888752 (https://phabricator.wikimedia.org/T325305) (owner: 10Ottomata) [20:39:11] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1001.eqiad.wmnet'] [20:43:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P44493 and previous config saved to /var/cache/conftool/dbconfig/20230213-204348-marostegui.json [20:44:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [20:46:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:47:13] (03PS4) 10Herron: rsync: remove rsync::server::wrap_with_stunnel [puppet] - 10https://gerrit.wikimedia.org/r/888065 [20:49:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P44494 and previous config saved to /var/cache/conftool/dbconfig/20230213-204920-ladsgroup.json [20:49:42] (03CR) 10CI reject: [V: 04-1] rsync: remove rsync::server::wrap_with_stunnel [puppet] - 10https://gerrit.wikimedia.org/r/888065 (owner: 10Herron) [20:50:48] (03PS5) 10Herron: rsync: remove rsync::server::wrap_with_stunnel [puppet] - 10https://gerrit.wikimedia.org/r/888065 [20:52:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P44495 and previous config saved to /var/cache/conftool/dbconfig/20230213-205211-marostegui.json [20:56:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:57:04] (03PS30) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [20:57:26] (03CR) 10CI reject: [V: 04-1] Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [20:58:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T329203)', diff saved to https://phabricator.wikimedia.org/P44496 and previous config saved to /var/cache/conftool/dbconfig/20230213-205855-marostegui.json [20:58:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance [20:58:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance [20:58:59] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [20:59:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T329203)', diff saved to https://phabricator.wikimedia.org/P44497 and previous config saved to /var/cache/conftool/dbconfig/20230213-205905-marostegui.json [20:59:15] (03CR) 10Sbailey: Enable Linter migration scripts for namespace and tag and template (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888111 (https://phabricator.wikimedia.org/T329342) (owner: 10Sbailey) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230213T2100). [21:00:05] cirno, RoanKattouw, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:18] (03PS6) 10Herron: rsync: remove rsync::server::wrap_with_stunnel [puppet] - 10https://gerrit.wikimedia.org/r/888065 [21:00:29] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:54] hi [21:01:11] I'm here [21:01:46] I can deploy if needed but I do have a meeting in 30 mins so if someone else is available that would be great [21:02:09] o/ [21:04:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P44498 and previous config saved to /var/cache/conftool/dbconfig/20230213-210426-ladsgroup.json [21:05:08] (03PS1) 10Dzahn: simplelamp2: change default mariadb datadir to /var/lib/mysql/ [puppet] - 10https://gerrit.wikimedia.org/r/888800 (https://phabricator.wikimedia.org/T329571) [21:05:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T329203)', diff saved to https://phabricator.wikimedia.org/P44499 and previous config saved to /var/cache/conftool/dbconfig/20230213-210513-marostegui.json [21:05:18] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [21:05:42] I can deploy [21:06:28] (03CR) 10Majavah: [C: 03+2] "deploying" [extensions/DiscussionTools] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/888768 (https://phabricator.wikimedia.org/T329523) (owner: 10Bartosz Dziewoński) [21:07:14] thanks [21:07:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T328817)', diff saved to https://phabricator.wikimedia.org/P44500 and previous config saved to /var/cache/conftool/dbconfig/20230213-210717-marostegui.json [21:07:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1114.eqiad.wmnet with reason: Maintenance [21:07:21] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [21:07:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1114.eqiad.wmnet with reason: Maintenance [21:07:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:07:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1114 (T328817)', diff saved to https://phabricator.wikimedia.org/P44501 and previous config saved to /var/cache/conftool/dbconfig/20230213-210738-marostegui.json [21:08:48] cirno: I'd like to have urbanecm review https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/884378/ given the comment on the task [21:08:58] * urbanecm waves [21:09:23] (03PS2) 10Urbanecm: lmowiktionary: Create extendedmover group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884378 (https://phabricator.wikimedia.org/T327340) (owner: 10Stang) [21:09:32] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884378 (https://phabricator.wikimedia.org/T327340) (owner: 10Stang) [21:09:46] RoanKattouw: I think https://gerrit.wikimedia.org/r/c/884361/ is going to need an i18n rebuild, and I don't think we have time for that before your meeting [21:09:53] thanks urbanecm [21:10:09] let's go ahead. i still feel it's excessive granurality of rights, but well, what can we do. [21:10:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884378 (https://phabricator.wikimedia.org/T327340) (owner: 10Stang) [21:10:17] Ooh that's right, I forgot about that [21:10:34] I'll reschedule for next week [21:10:53] sounds good, thank you and sorry [21:10:55] (03Merged) 10jenkins-bot: lmowiktionary: Create extendedmover group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884378 (https://phabricator.wikimedia.org/T327340) (owner: 10Stang) [21:10:58] thanks for your advice urbanecm [21:11:06] No worries! [21:11:11] np [21:11:52] ottomata: hi! https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/888752 was merged but never deployed. are you around to test it now should I revert it? [21:12:05] (03Merged) 10jenkins-bot: ReplyLinksController: Fix teardown failing when reloading [extensions/DiscussionTools] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/888768 (https://phabricator.wikimedia.org/T329523) (owner: 10Bartosz Dziewoński) [21:12:22] (03CR) 10Urbanecm: [C: 03+1] "LGTM. Risk of medium was accepted by a WMF director (T328163#8586171), which is enough for a beta deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884361 (https://phabricator.wikimedia.org/T315621) (owner: 10Catrope) [21:12:57] RoanKattouw: and please split the config vars to a separate patch from the one that adds it to extension-list [21:13:58] is that needed those days, when we do sync-world on every patch anyway? [21:14:13] (KubernetesRsyslogDown) firing: (3) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:14:40] (03PS1) 10Majavah: Revert "Enable mediawiki.page_change on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888769 [21:15:11] (03PS2) 10Majavah: Revert "Enable mediawiki.page_change on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888769 [21:15:32] (03CR) 10Majavah: [C: 03+2] "merging to unblock other deployments" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888769 (owner: 10Majavah) [21:15:57] !log taavi@deploy1002 Backport cancelled. [21:16:24] (03Merged) 10jenkins-bot: Revert "Enable mediawiki.page_change on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888769 (owner: 10Majavah) [21:16:55] !log taavi@deploy1002 Started scap: lmowiktionary: Create extendedmover group (T327340) [21:16:59] T327340: Request extended mover user right at lmo.wiktionary.org - https://phabricator.wikimedia.org/T327340 [21:18:40] !log taavi@deploy1002 taavi: lmowiktionary: Create extendedmover group (T327340) synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:18:45] cirno: please test that config patch [21:19:03] taavi, checked Special:Listgrouprights and LGTM [21:19:10] (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate labstore1006.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:19:12] thanks, syncing [21:19:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P44502 and previous config saved to /var/cache/conftool/dbconfig/20230213-211932-ladsgroup.json [21:19:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [21:19:35] (03CR) 10Dzahn: "The main review question is.. is there any benefit to using /srv/ on a cloud VPS rather than /var/? Does that increase the chance that it'" [puppet] - 10https://gerrit.wikimedia.org/r/888800 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn) [21:19:37] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [21:19:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [21:20:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P44503 and previous config saved to /var/cache/conftool/dbconfig/20230213-212020-marostegui.json [21:25:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [21:25:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [21:25:24] !log taavi@deploy1002 Finished scap: lmowiktionary: Create extendedmover group (T327340) (duration: 08m 28s) [21:25:29] T327340: Request extended mover user right at lmo.wiktionary.org - https://phabricator.wikimedia.org/T327340 [21:25:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P44504 and previous config saved to /var/cache/conftool/dbconfig/20230213-212529-ladsgroup.json [21:25:33] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [21:25:38] cirno: deployed! [21:25:38] (03PS31) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [21:25:45] ty! [21:25:46] MatmaRex: yours is next [21:25:57] cool [21:26:00] (03CR) 10CI reject: [V: 04-1] Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [21:26:19] !log taavi@deploy1002 Started scap: Backport for [[gerrit:888768|ReplyLinksController: Fix teardown failing when reloading (T329523)]] [21:26:22] T329523: Reply tool "Show 1 new comment" button causes the tool to disappear, doesn't show new comments - https://phabricator.wikimedia.org/T329523 [21:26:33] (03CR) 10Herron: "here's a current pcc https://puppet-compiler.wmflabs.org/output/888065/39549" [puppet] - 10https://gerrit.wikimedia.org/r/888065 (owner: 10Herron) [21:27:56] !log taavi@deploy1002 taavi and matmarex: Backport for [[gerrit:888768|ReplyLinksController: Fix teardown failing when reloading (T329523)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:28:03] taavi: hi sorry yes! [21:28:14] i got pullled away right after I hit +2 and was waiting for jenkins [21:28:28] sorry [21:28:33] MatmaRex: please test [21:28:41] that change is safe to go out [21:28:59] ottomata: I reverted it already to unblock others, we can ship an un-reverted after this one [21:29:03] k [21:29:11] sorry abou tthat [21:29:17] taavi: thanks, looks good [21:29:22] syncing [21:29:24] ottomata: no worries! [21:29:27] i was waititng all day for jenkins :) then got a little sidetracked last minute [21:29:44] (03PS1) 10Majavah: Revert "Revert "Enable mediawiki.page_change on group1 wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888770 [21:29:52] (03CR) 10Majavah: [C: 03+2] Revert "Revert "Enable mediawiki.page_change on group1 wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888770 (owner: 10Majavah) [21:30:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:30:38] (03Merged) 10jenkins-bot: Revert "Revert "Enable mediawiki.page_change on group1 wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888770 (owner: 10Majavah) [21:32:10] (03PS1) 10Andrew Bogott: Revert "cloud-vps/galera: standardize timeouts across haprox/mysql" [puppet] - 10https://gerrit.wikimedia.org/r/888771 [21:32:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P44505 and previous config saved to /var/cache/conftool/dbconfig/20230213-213256-ladsgroup.json [21:33:00] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [21:33:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:33:50] ottomata: can your patch be tested on a mwdebug host? [21:34:57] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:888768|ReplyLinksController: Fix teardown failing when reloading (T329523)]] (duration: 08m 38s) [21:35:01] T329523: Reply tool "Show 1 new comment" button causes the tool to disappear, doesn't show new comments - https://phabricator.wikimedia.org/T329523 [21:35:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P44506 and previous config saved to /var/cache/conftool/dbconfig/20230213-213526-marostegui.json [21:35:30] !log taavi@deploy1002 Started scap: Backport for [[gerrit:888770|Revert "Revert "Enable mediawiki.page_change on group1 wikis""]] [21:35:59] (03CR) 10Andrew Bogott: [C: 03+2] Revert "cloud-vps/galera: standardize timeouts across haprox/mysql" [puppet] - 10https://gerrit.wikimedia.org/r/888771 (owner: 10Andrew Bogott) [21:36:04] hmm taavi i think so [21:36:09] lets see what is a group1 wiki... [21:36:19] metawiki is for example [21:36:21] or test2 [21:36:22] perfect [21:36:33] give me one minute, pulling it to mwdebug [21:37:15] !log taavi@deploy1002 taavi: Backport for [[gerrit:888770|Revert "Revert "Enable mediawiki.page_change on group1 wikis""]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [21:37:25] ottomata: now it's available [21:38:10] testing [21:38:41] taavi: it works! [21:38:44] proceed thank you [21:38:51] thanks! will do [21:39:48] !log xcollazo@deploy1002 Started deploy [airflow-dags/platform_eng@5edcd7b]: deploying section_topics v0.5.0 on platform_eng Airflow instance [21:40:06] !log xcollazo@deploy1002 Finished deploy [airflow-dags/platform_eng@5edcd7b]: deploying section_topics v0.5.0 on platform_eng Airflow instance (duration: 00m 17s) [21:42:56] !log cmooney@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1002.eqiad.wmnet'] [21:44:31] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:888770|Revert "Revert "Enable mediawiki.page_change on group1 wikis""]] (duration: 09m 00s) [21:44:36] ottomata: all done! [21:45:32] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:45:49] (03PS32) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [21:48:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P44507 and previous config saved to /var/cache/conftool/dbconfig/20230213-214802-ladsgroup.json [21:50:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T329203)', diff saved to https://phabricator.wikimedia.org/P44508 and previous config saved to /var/cache/conftool/dbconfig/20230213-215034-marostegui.json [21:50:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [21:50:38] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [21:50:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [21:50:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T329203)', diff saved to https://phabricator.wikimedia.org/P44509 and previous config saved to /var/cache/conftool/dbconfig/20230213-215055-marostegui.json [21:51:30] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1002.eqiad.wmnet'] [21:53:06] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:53:48] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [21:54:13] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:55:35] (03PS1) 10Esanders: Enable history page visual diffs every except Wikipedias and Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888804 (https://phabricator.wikimedia.org/T314588) [21:56:26] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for cloudcephosd1002 - cmooney@cumin1001" [21:57:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T329203)', diff saved to https://phabricator.wikimedia.org/P44510 and previous config saved to /var/cache/conftool/dbconfig/20230213-215701-marostegui.json [21:57:05] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [21:57:30] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for cloudcephosd1002 - cmooney@cumin1001" [21:57:30] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:59:51] (03Abandoned) 10Jforrester: wgDiscussionToolsABTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759256 (owner: 10Esanders) [22:00:04] Reedy, sbassett, Maryum, and manfredi: It is that lovely time of the day again! You are hereby commanded to deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230213T2200). [22:01:42] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:03:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P44511 and previous config saved to /var/cache/conftool/dbconfig/20230213-220308-ladsgroup.json [22:06:56] (03PS2) 10Esanders: Enable history page visual diffs everywhere except Wikipedias and Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888804 (https://phabricator.wikimedia.org/T314588) [22:07:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T328817)', diff saved to https://phabricator.wikimedia.org/P44512 and previous config saved to /var/cache/conftool/dbconfig/20230213-220753-marostegui.json [22:07:57] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [22:08:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:10:32] (03PS1) 10Papaul: Add new mw node to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/888806 (https://phabricator.wikimedia.org/T326362) [22:10:42] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:11:32] (03CR) 10Papaul: [C: 03+2] Add new mw node to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/888806 (https://phabricator.wikimedia.org/T326362) (owner: 10Papaul) [22:12:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P44513 and previous config saved to /var/cache/conftool/dbconfig/20230213-221207-marostegui.json [22:13:50] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2436.codfw.wmnet with OS buster [22:14:03] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2436.codfw.wmnet with OS buster [22:16:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:18:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P44514 and previous config saved to /var/cache/conftool/dbconfig/20230213-221815-ladsgroup.json [22:18:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [22:18:19] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [22:18:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [22:18:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance [22:18:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance [22:18:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P44515 and previous config saved to /var/cache/conftool/dbconfig/20230213-221840-ladsgroup.json [22:19:36] taavi: thank you! [22:23:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P44516 and previous config saved to /var/cache/conftool/dbconfig/20230213-222300-marostegui.json [22:23:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:25:39] (03PS1) 10Zabe: beta: Depool deployment-db10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888807 (https://phabricator.wikimedia.org/T329577) [22:25:52] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2437.codfw.wmnet with OS buster [22:25:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P44517 and previous config saved to /var/cache/conftool/dbconfig/20230213-222556-ladsgroup.json [22:26:01] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [22:26:01] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2437.codfw.wmnet with OS buster [22:27:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P44518 and previous config saved to /var/cache/conftool/dbconfig/20230213-222713-marostegui.json [22:27:56] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-gp2002'] [22:28:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888807 (https://phabricator.wikimedia.org/T329577) (owner: 10Zabe) [22:28:41] (03PS33) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [22:28:46] (03Merged) 10jenkins-bot: beta: Depool deployment-db10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888807 (https://phabricator.wikimedia.org/T329577) (owner: 10Zabe) [22:30:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:33:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2436.codfw.wmnet with reason: host reimage [22:35:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:35:39] (03PS1) 10Dzahn: devtools: change gerrit hostname to use wmcloud, not wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/888808 (https://phabricator.wikimedia.org/T329444) [22:36:42] !log upgrading firmware on mc-gp2002 [22:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2436.codfw.wmnet with reason: host reimage [22:38:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P44519 and previous config saved to /var/cache/conftool/dbconfig/20230213-223806-marostegui.json [22:39:18] PROBLEM - Host mc-gp2002 is DOWN: PING CRITICAL - Packet loss = 100% [22:40:32] (03CR) 10Aklapper: "Ping - can someone with sufficient permissions please abandon this patch? Thanks" [deployment-charts] - 10https://gerrit.wikimedia.org/r/748734 (owner: 10Varac) [22:41:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P44520 and previous config saved to /var/cache/conftool/dbconfig/20230213-224102-ladsgroup.json [22:41:57] (03Abandoned) 10Ahmon Dancy: Also support older k8s versions <=1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/748734 (owner: 10Varac) [22:42:13] (03Abandoned) 10Aklapper: Redirect phabricator.wikimedia.org/r/ to gerrit.wikimedia.org/g/ [puppet] - 10https://gerrit.wikimedia.org/r/863229 (https://phabricator.wikimedia.org/T324311) (owner: 10Aklapper) [22:42:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T329203)', diff saved to https://phabricator.wikimedia.org/P44521 and previous config saved to /var/cache/conftool/dbconfig/20230213-224219-marostegui.json [22:42:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2182.codfw.wmnet with reason: Maintenance [22:42:23] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [22:42:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2182.codfw.wmnet with reason: Maintenance [22:42:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T329203)', diff saved to https://phabricator.wikimedia.org/P44522 and previous config saved to /var/cache/conftool/dbconfig/20230213-224240-marostegui.json [22:44:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['mc-gp2002'] [22:45:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2437.codfw.wmnet with reason: host reimage [22:45:22] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-gp2002'] [22:45:30] RECOVERY - Host mc-gp2002 is UP: PING OK - Packet loss = 0%, RTA = 31.65 ms [22:45:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:48:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2437.codfw.wmnet with reason: host reimage [22:48:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T329203)', diff saved to https://phabricator.wikimedia.org/P44523 and previous config saved to /var/cache/conftool/dbconfig/20230213-224837-marostegui.json [22:48:41] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [22:49:28] (03PS4) 10Zabe: stop setting checkuser actor/comment migration variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886476 (https://phabricator.wikimedia.org/T233004) [22:49:30] PROBLEM - Host mc-gp2002 is DOWN: PING CRITICAL - Packet loss = 100% [22:50:32] (03CR) 10Zabe: [C: 03+2] stop setting checkuser actor/comment migration variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886476 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [22:51:23] (03Merged) 10jenkins-bot: stop setting checkuser actor/comment migration variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886476 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [22:51:53] !log zabe@deploy1002 Started scap: Backport for [[gerrit:886476|stop setting checkuser actor/comment migration variables (T233004)]] [22:51:57] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [22:52:25] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['mc-gp2002'] [22:52:40] RECOVERY - Host mc-gp2002 is UP: PING OK - Packet loss = 0%, RTA = 31.72 ms [22:53:08] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [22:53:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T328817)', diff saved to https://phabricator.wikimedia.org/P44524 and previous config saved to /var/cache/conftool/dbconfig/20230213-225312-marostegui.json [22:53:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1116.eqiad.wmnet with reason: Maintenance [22:53:16] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [22:53:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1116.eqiad.wmnet with reason: Maintenance [22:53:34] !log zabe@deploy1002 zabe: Backport for [[gerrit:886476|stop setting checkuser actor/comment migration variables (T233004)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [22:54:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:56:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P44525 and previous config saved to /var/cache/conftool/dbconfig/20230213-225610-ladsgroup.json [22:59:40] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:886476|stop setting checkuser actor/comment migration variables (T233004)]] (duration: 07m 46s) [22:59:43] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [23:03:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P44526 and previous config saved to /var/cache/conftool/dbconfig/20230213-230343-marostegui.json [23:04:21] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-gp2002'] [23:04:43] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:06:02] PROBLEM - Host mc-gp2002 is DOWN: PING CRITICAL - Packet loss = 100% [23:10:23] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['mc-gp2002'] [23:10:26] RECOVERY - Host mc-gp2002 is UP: PING OK - Packet loss = 0%, RTA = 31.65 ms [23:11:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P44527 and previous config saved to /var/cache/conftool/dbconfig/20230213-231116-ladsgroup.json [23:11:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [23:11:20] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [23:11:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [23:11:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P44528 and previous config saved to /var/cache/conftool/dbconfig/20230213-231137-ladsgroup.json [23:13:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1126.eqiad.wmnet with reason: Maintenance [23:13:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1126.eqiad.wmnet with reason: Maintenance [23:14:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1126 (T328817)', diff saved to https://phabricator.wikimedia.org/P44529 and previous config saved to /var/cache/conftool/dbconfig/20230213-231402-marostegui.json [23:14:06] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [23:16:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:18:27] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:18:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2437.codfw.wmnet with OS buster [23:18:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:18:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2436.codfw.wmnet with OS buster [23:18:34] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2437.codfw.wmnet with OS buster completed: - mw2437 (**PASS**) - Removed from Pupp... [23:18:38] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2436.codfw.wmnet with OS buster completed: - mw2436 (**PASS**) - Removed from Pupp... [23:18:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P44530 and previous config saved to /var/cache/conftool/dbconfig/20230213-231850-marostegui.json [23:19:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P44531 and previous config saved to /var/cache/conftool/dbconfig/20230213-231900-ladsgroup.json [23:19:04] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [23:23:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:30:44] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:30:46] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:33:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T329203)', diff saved to https://phabricator.wikimedia.org/P44532 and previous config saved to /var/cache/conftool/dbconfig/20230213-233356-marostegui.json [23:33:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1101.eqiad.wmnet with reason: Maintenance [23:34:00] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [23:34:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1101.eqiad.wmnet with reason: Maintenance [23:34:04] (ProbeDown) firing: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:34:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P44533 and previous config saved to /var/cache/conftool/dbconfig/20230213-233406-ladsgroup.json [23:34:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T329203)', diff saved to https://phabricator.wikimedia.org/P44534 and previous config saved to /var/cache/conftool/dbconfig/20230213-233407-marostegui.json [23:35:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2438.codfw.wmnet with OS buster [23:35:46] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2438.codfw.wmnet with OS buster [23:36:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:36:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T328817)', diff saved to https://phabricator.wikimedia.org/P44535 and previous config saved to /var/cache/conftool/dbconfig/20230213-233617-marostegui.json [23:36:21] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [23:36:45] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2439.codfw.wmnet with OS buster [23:36:52] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2439.codfw.wmnet with OS buster [23:39:04] (ProbeDown) resolved: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:40:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T329203)', diff saved to https://phabricator.wikimedia.org/P44536 and previous config saved to /var/cache/conftool/dbconfig/20230213-234040-marostegui.json [23:40:42] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-gp2003'] [23:40:44] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [23:44:52] (03PS1) 10Dzahn: alertmanager: create mapping for serviceops-collab task-only alerts [puppet] - 10https://gerrit.wikimedia.org/r/888812 (https://phabricator.wikimedia.org/T329587) [23:45:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:48:27] !log upgrading firmware on mc-gp2003 [23:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:35] (03PS1) 10Dzahn: serviceops-collab: switch alert severity to 'task' globally [puppet] - 10https://gerrit.wikimedia.org/r/888813 (https://phabricator.wikimedia.org/T329587) [23:49:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P44537 and previous config saved to /var/cache/conftool/dbconfig/20230213-234912-ladsgroup.json [23:51:16] PROBLEM - Host mc-gp2003 is DOWN: PING CRITICAL - Packet loss = 100% [23:51:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P44538 and previous config saved to /var/cache/conftool/dbconfig/20230213-235123-marostegui.json [23:52:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:55:28] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2438.codfw.wmnet with reason: host reimage [23:55:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P44539 and previous config saved to /var/cache/conftool/dbconfig/20230213-235546-marostegui.json [23:56:26] RECOVERY - Host mc-gp2003 is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms [23:56:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['mc-gp2003'] [23:56:45] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2439.codfw.wmnet with reason: host reimage [23:57:46] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-gp2003'] [23:58:02] (03CR) 10Dzahn: [C: 03+2] alertmanager: create mapping for serviceops-collab task-only alerts [puppet] - 10https://gerrit.wikimedia.org/r/888812 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [23:58:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2438.codfw.wmnet with reason: host reimage